R for applied epidemiology and public health

This handbook strives to:

  • Serve as a quick R code reference manual
  • Provide task-centered examples addressing common epidemiological problems
  • Assist epidemiologists transitioning to R
  • Be accessible in settings with low internet-connectivity via an offline version


 

Written by epidemiologists, for epidemiologists

We are applied epis from around the world, writing in our spare time to offer this resource to the community. Your encouragement and feedback is most welcome:

How to use this handbook

  • Browse the pages in the Table of Contents, or use the search box
  • Click the “copy” icons to copy code
  • You can follow-along with the example data
  • See the “Resources” section of each page for further material

Offline version

See instructions in the Download handbook and data page.

Languages

We want to translate this into languages other than English. If you can help, please contact us.

Acknowledgements

This handbook is produced by a collaboration of epidemiologists from around the world drawing upon experience with organizations including local, state, provincial, and national health agencies, the World Health Organization (WHO), Médecins Sans Frontières / Doctors without Borders (MSF), hospital systems, and academic institutions.

This handbook is not an approved product of any specific organization. Although we strive for accuracy, we provide no guarantee of the content in this book.

Contributors

Editor: Neale Batra

Authors: Neale Batra, Alex Spina, Paula Blomquist, Finlay Campbell, Henry Laurenson-Schafer, Isaac Florence, Natalie Fischer, Aminata Ndiaye, Liza Coyer, Jonathan Polonsky, Yurie Izawa, Chris Bailey, Daniel Molling, Isha Berry, Emma Buajitti, Mathilde Mousset, Sara Hollis, Wen Lin

Reviewers and supporters: Pat Keating, Amrish Baidjoe, Annick Lenglet, Margot Charette, Danielly Xavier, Marie-Amélie Degail Chabrat, Esther Kukielka, Michelle Sloan, Aybüke Koyuncu, Rachel Burke, Kate Kelsey, Berhe Etsay, John Rossow, Mackenzie Zendt, James Wright, Laura Haskins, Flavio Finger, Tim Taylor, Jae Hyoung Tim Lee, Brianna Bradley, Wayne Enanoria, Manual Albela Miranda, Molly Mantus, Pattama Ulrich, Joseph Timothy, Adam Vaughan, Olivia Varsaneux, Lionel Monteiro, Joao Muianga

Illustrations: Calder Fong

Funding and support

The handbook received supportive funding via a COVID-19 emergency capacity-building grant from TEPHINET, the global network of Field Epidemiology Training Programs (FETPs).

Administrative support was provided by the EPIET Alumni Network (EAN), with special thanks to Annika Wendland. EPIET is the European Programme for Intervention Epidemiology Training.

Special thanks to Médecins Sans Frontières (MSF) Operational Centre Amsterdam (OCA) for their support during the development of this handbook.

This publication was supported by Cooperative Agreement number NU2GGH001873, funded by the Centers for Disease Control and Prevention through TEPHINET, a program of The Task Force for Global Health. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the Centers for Disease Control and Prevention, the Department of Health and Human Services, The Task Force for Global Health, Inc. or TEPHINET.

Inspiration

The multitude of tutorials and vignettes that provided knowledge for development of handbook content are credited within their respective pages.

More generally, the following sources provided inspiration for this handbook:
The “R4Epis” project (a collaboration between MSF and RECON)
R Epidemics Consortium (RECON)
R for Data Science book (R4DS)
bookdown: Authoring Books and Technical Documents with R Markdown
Netlify hosts this website

Terms of Use and Contribution

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Academic courses and epidemiologist training programs are welcome to use this handbook with their students. If you have questions about your intended use, email .

Citation

Batra, Neale, et al. The Epidemiologist R Handbook. 2021. DOI

Contribution

If you would like to make a content contribution, please contact with us first via Github issues or by email. We are implementing a schedule for updates and are creating a contributor guide.

Please note that the epiRhandbook project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

(PART) About this book

1 Editorial and technical notes

In this page we describe the philosophical approach, style, and specific editorial decisions made during the creation of this handbook.

1.1 Approach and style

The potential audience for this book is large. It will surely be used by people very new to R, and also by experienced R users looking for best practices and tips. So it must be both accessible and succinct. Therefore, our approach was to provide just enough text explanation that someone very new to R can apply the code and follow what the code is doing.

A few other points:

  • This is a code reference book accompanied by relatively brief examples - not a thorough textbook on R or data science
  • This is a R handbook for use within applied epidemiology - not a manual on the methods or science of applied epidemiology
  • This is intended to be a living document - optimal R packages for a given task change often and we welcome discussion about which to emphasize in this handbook

R packages

So many choices

One of the most challenging aspects of learning R is knowing which R package to use for a given task. It is a common occurrence to struggle through a task only later to realize - hey, there’s an R package that does all that in one command line!

In this handbook, we try to offer you at least two ways to complete each task: one tried-and-true method (probably in base R or tidyverse) and one special R package that is custom-built for that purpose. We want you to have a couple options in case you can’t download a given package or it otherwise does not work for you.

In choosing which packages to use, we prioritized R packages and approaches that have been tested and vetted by the community, minimize the number of packages used in a typical work session, that are stable (not changing very often), and that accomplish the task simply and cleanly

This handbook generally prioritizes R packages and functions from the tidyverse. Tidyverse is a collection of R packages designed for data science that share underlying grammar and data structures. All tidyverse packages can be installed or loaded via the tidyverse package. Read more at the tidyverse website.

When applicable, we also offer code options using base R - the packages and functions that come with R at installation. This is because we recognize that some of this book’s audience may not have reliable internet to download extra packages.

Linking functions to packages explicitly

It is often frustrating in R tutorials when a function is shown in code, but you don’t know which package it is from! We try to avoid this situation.

In the narrative text, package names are written in bold (e.g. dplyr) and functions are written like this: mutate(). We strive to be explicit about which package a function comes from, either by referencing the package in nearby text or by specifying the package explicitly in the code like this: dplyr::mutate(). It may look redundant, but we are doing it on purpose.

See the page on R basics to learn more about packages and functions.

Code style

In the handbook, we frequently utilize “new lines”, making our code appear “long”. We do this for a few reasons:

  • We can write explanatory comments with # that are adjacent to each little part of the code
  • Generally, longer (vertical) code is easier to read
  • It is easier to read on a narrow screen (no sideways scrolling needed)
  • From the indentations, it can be easier to know which arguments belong to which function

As a result, code that could be written like this:

linelist %>% 
  group_by(hospital) %>%  # group rows by hospital
  slice_max(date, n = 1, with_ties = F) # if there's a tie (of date), take the first row

…is written like this:

linelist %>% 
  group_by(hospital) %>% # group rows by hospital
  slice_max(
    date,                # keep row per group with maximum date value 
    n = 1,               # keep only the single highest row 
    with_ties = F)       # if there's a tie (of date), take the first row

R code is generally not affected by new lines or indentations. When writing code, if you initiate a new line after a comma it will apply automatic indentation patterns.

We also use lots of spaces (e.g. n = 1 instead of n=1) because it is easier to read. Be kind to the people reading your code!

Nomenclature

In this handbook, we generally reference “columns” and “rows” instead of “variables” and “observations”. As explained in this primer on “tidy data”, most epidemiological statistical datasets consist structurally of rows, columns, and values.

Variables contain the values that measure the same underlying attribute (like age group, outcome, or date of onset). Observations contain all values measured on the same unit (e.g. a person, site, or lab sample). So these aspects can be more difficult to tangibly define.

In “tidy” datasets, each column is a variable, each row is an observation, and each cell is a single value. However some datasets you encounter will not fit this mold - a “wide” format dataset may have a variable split across several columns (see an example in the Pivoting data page). Likewise, observations could be split across several rows.

Most of this handbook is about managing and transforming data, so referring to the concrete data structures of rows and columns is more relevant than the more abstract observations and variables. Exceptions occur primarily in pages on data analysis, where you will see more references to variables and observations.

Notes

Here are the types of notes you may encounter in the handbook:

NOTE: This is a note
TIP: This is a tip.
CAUTION: This is a cautionary note.
DANGER: This is a warning.

1.2 Editorial decisions

Below, we track significant editorial decisions around package and function choice. If you disagree or want to offer a new tool for consideration, please join/start a conversation on our Github page.

Table of package, function, and other editorial decisions

Subject Considered Outcome Brief rationale
General coding approach tidyverse, data.table, base tidyverse, with a page on data.table, and mentions of base alternatives for readers with no internet tidyverse readability, universality, most-taught
Package loading library(),install.packages(), require(), pacman pacman Shortens and simplifies code for most multi-package install/load use-cases
Import and export rio, many other packages rio Ease for many file types
Grouping for summary statistics dplyr group_by(), stats aggregate() dplyr group_by() Consistent with tidyverse emphasis
Pivoting tidyr (pivot functions), reshape2 (melt/cast), tidyr (spread/gather) tidyr (pivot functions) reshape2 is retired, tidyr uses pivot functions as of v1.0.0
Clean column names linelist, janitor janitor Consolidation of packages emphasized
Epiweeks lubridate, aweek, tsibble, zoo lubridate generally, the others for specific cases lubridate’s flexibility, consistency, package maintenance prospects
ggplot labels labs(), ggtitle()/ylab()/xlab() labs() all labels in one place, simplicity
Convert to factor factor(), forcats forcats its various functions also convert to factor in same command
Epidemic curves incidence, ggplot2, EpiCurve incidence2 as quick, ggplot2 as detailed dependability
Concatenation paste(), paste0(), str_glue(), glue() str_glue() More simple syntax than paste functions; within stringr

1.3 Major revisions

Date Major changes
10 May 2021 Release of version 1.0.0

1.4 Session info (R, RStudio, packages)

Below is the information on the versions of R, RStudio, and R packages used during this rendering of the Handbook.

sessioninfo::session_info()
## - Session info ----------------------------------------------------------------------------------------------------------------------------------------------------
##  setting  value                       
##  version  R version 4.1.0 (2021-05-18)
##  os       Windows 10 x64              
##  system   x86_64, mingw32             
##  ui       RStudio                     
##  language (EN)                        
##  collate  English_United States.1252  
##  ctype    English_United States.1252  
##  tz       America/New_York            
##  date     2021-08-31                  
## 
## - Packages --------------------------------------------------------------------------------------------------------------------------------------------------------
##  ! package              * version     date       lib source                               
##    abind                * 1.4-5       2016-07-21 [1] CRAN (R 4.0.0)                       
##    ada                    2.0-5       2016-05-13 [1] CRAN (R 4.0.3)                       
##    adagio                 0.8.4       2021-04-30 [1] CRAN (R 4.1.0)                       
##    ade4                   1.7-16      2020-10-28 [1] CRAN (R 4.0.3)                       
##    anytime                0.3.9       2020-08-27 [1] CRAN (R 4.0.2)                       
##    ape                  * 5.5         2021-04-25 [1] CRAN (R 4.1.0)                       
##    aplot                  0.0.6       2020-09-03 [1] CRAN (R 4.0.3)                       
##    apyramid             * 0.1.2       2020-05-08 [1] CRAN (R 4.0.2)                       
##    assertive.base         0.0-9       2021-02-08 [1] CRAN (R 4.1.0)                       
##    assertive.properties   0.0-4       2016-12-30 [1] CRAN (R 4.1.0)                       
##    assertive.types        0.0-3       2016-12-30 [1] CRAN (R 4.1.0)                       
##    assertthat             0.2.1       2019-03-21 [1] CRAN (R 4.0.0)                       
##    aweek                * 1.0.2       2021-01-04 [1] CRAN (R 4.0.3)                       
##    backports              1.2.1       2020-12-09 [1] CRAN (R 4.0.3)                       
##    base64enc              0.1-3       2015-07-28 [1] CRAN (R 4.0.0)                       
##    bayestestR             0.10.0      2021-05-31 [1] CRAN (R 4.1.0)                       
##    BiocManager            1.30.15     2021-05-11 [1] CRAN (R 4.1.0)                       
##    bit                  * 4.0.4       2020-08-04 [1] CRAN (R 4.0.3)                       
##    bit64                  4.0.5       2020-08-30 [1] CRAN (R 4.0.3)                       
##    blob                   1.2.1       2020-01-20 [1] CRAN (R 4.0.2)                       
##    bookdown               0.22        2021-04-22 [1] CRAN (R 4.1.0)                       
##    boot                 * 1.3-28      2021-05-03 [1] CRAN (R 4.1.0)                       
##    broom                * 0.7.6       2021-04-05 [1] CRAN (R 4.0.5)                       
##    broom.helpers          1.3.0       2021-04-10 [1] CRAN (R 4.1.0)                       
##    bslib                  0.2.5.1     2021-05-18 [1] CRAN (R 4.0.5)                       
##    cachem                 1.0.5       2021-05-15 [1] CRAN (R 4.0.5)                       
##    callr                  3.7.0       2021-04-20 [1] CRAN (R 4.0.5)                       
##    car                    3.0-10      2020-09-29 [1] CRAN (R 4.0.3)                       
##    carData                3.0-4       2020-05-22 [1] CRAN (R 4.0.0)                       
##    cellranger             1.1.0       2016-07-27 [1] CRAN (R 4.0.0)                       
##    checkmate              2.0.0       2020-02-06 [1] CRAN (R 4.0.2)                       
##    class                  7.3-19      2021-05-03 [1] CRAN (R 4.1.0)                       
##    classInt               0.4-3       2020-04-07 [1] CRAN (R 4.0.0)                       
##    cli                    3.0.0       2021-06-30 [1] CRAN (R 4.1.0)                       
##    clock                  0.3.0       2021-04-22 [1] CRAN (R 4.0.5)                       
##    cmprsk                 2.2-10      2020-06-09 [1] CRAN (R 4.0.4)                       
##    coarseDataTools        0.6-5       2019-12-06 [1] CRAN (R 4.0.2)                       
##    coda                   0.19-4      2020-09-30 [1] CRAN (R 4.0.3)                       
##    codetools              0.2-18      2020-11-04 [1] CRAN (R 4.1.0)                       
##    colorspace             2.0-2       2021-06-24 [1] CRAN (R 4.1.0)                       
##    commonmark             1.7         2018-12-01 [1] CRAN (R 4.0.2)                       
##    conquer                1.0.2       2020-08-27 [1] CRAN (R 4.0.2)                       
##    corrr                * 0.4.3       2020-11-24 [1] CRAN (R 4.0.3)                       
##    cowplot              * 1.1.1       2020-12-30 [1] CRAN (R 4.0.3)                       
##    crayon                 1.4.1       2021-02-08 [1] CRAN (R 4.0.4)                       
##    crosstalk              1.1.1       2021-01-12 [1] CRAN (R 4.0.3)                       
##    curl                   4.3.2       2021-06-23 [1] CRAN (R 4.1.0)                       
##    curry                  0.1.1       2016-09-28 [1] CRAN (R 4.1.0)                       
##    data.table           * 1.14.0      2021-02-21 [1] CRAN (R 4.0.4)                       
##    DBI                  * 1.1.1       2021-01-15 [1] CRAN (R 4.0.3)                       
##    dbplyr                 2.1.1       2021-04-06 [1] CRAN (R 4.0.5)                       
##    deldir                 0.2-10      2021-02-16 [1] CRAN (R 4.0.4)                       
##    Deriv                  4.1.3       2021-02-24 [1] CRAN (R 4.0.5)                       
##    DiagrammeR           * 1.0.6.1     2020-05-08 [1] CRAN (R 4.0.3)                       
##    dichromat              2.0-0       2013-01-24 [1] CRAN (R 4.0.3)                       
##    digest                 0.6.27      2020-10-24 [1] CRAN (R 4.1.0)                       
##    distcrete            * 1.0.3       2017-11-23 [1] CRAN (R 4.0.2)                       
##    distributional         0.2.2       2021-02-02 [1] CRAN (R 4.0.3)                       
##    doBy                 * 4.6.10      2021-04-29 [1] CRAN (R 4.1.0)                       
##    doParallel             1.0.16      2020-10-16 [1] CRAN (R 4.0.3)                       
##    downlit                0.2.1       2020-11-04 [1] CRAN (R 4.0.3)                       
##    dplyr                * 1.0.6       2021-05-05 [1] CRAN (R 4.1.0)                       
##    dsr                  * 0.2.2       2019-08-23 [1] CRAN (R 4.0.2)                       
##    DT                   * 0.18        2021-04-14 [1] CRAN (R 4.1.0)                       
##    e1071                  1.7-7       2021-05-23 [1] CRAN (R 4.1.0)                       
##    ecmwfr               * 1.3.0       2020-07-13 [1] CRAN (R 4.0.3)                       
##    effectsize             0.4.5       2021-05-25 [1] CRAN (R 4.1.0)                       
##    ellipsis               0.3.2       2021-04-29 [1] CRAN (R 4.1.0)                       
##    Epi                  * 2.44        2021-02-27 [1] CRAN (R 4.0.4)                       
##    epibuffet              0.0.0.9005  2021-07-12 [1] Github (R4EPI/epibuffet@39cbc7e)     
##    epicontacts          * 1.2.0       2021-05-30 [1] Github (reconhub/epicontacts@facf491)
##    epidict                0.0.0.9001  2021-07-12 [1] Github (R4EPI/epidict@1893db0)       
##    EpiEstim             * 2.2-4       2021-01-07 [1] CRAN (R 4.0.3)                       
##    epikit               * 0.1.2       2021-04-29 [1] Github (R4EPI/epikit@9267b80)        
##    EpiNow2              * 1.3.2       2020-12-14 [1] CRAN (R 4.1.0)                       
##    epitrix              * 0.2.2       2019-01-15 [1] CRAN (R 4.0.2)                       
##    etm                    1.1.1       2020-09-08 [1] CRAN (R 4.0.4)                       
##    evaluate               0.14        2019-05-28 [1] CRAN (R 4.1.0)                       
##    evd                    2.3-3       2018-04-25 [1] CRAN (R 4.0.3)                       
##    expm                   0.999-6     2021-01-13 [1] CRAN (R 4.0.5)                       
##    fabletools           * 0.3.1       2021-03-16 [1] CRAN (R 4.0.5)                       
##    FactoClass             1.2.7       2018-10-01 [1] CRAN (R 4.0.3)                       
##    fansi                  0.5.0       2021-05-25 [1] CRAN (R 4.1.0)                       
##    farver                 2.1.0       2021-02-28 [1] CRAN (R 4.1.0)                       
##    fastLink             * 0.6.0       2020-04-29 [1] CRAN (R 4.1.0)                       
##    fastmap                1.1.0       2021-01-25 [1] CRAN (R 4.1.0)                       
##    feasts               * 0.2.1       2021-03-22 [1] CRAN (R 4.0.5)                       
##    ff                   * 4.0.4       2020-10-13 [1] CRAN (R 4.0.3)                       
##    fitdistrplus           1.1-5       2021-05-28 [1] CRAN (R 4.1.0)                       
##    flexdashboard        * 0.5.2       2020-06-24 [1] CRAN (R 4.0.5)                       
##    flextable            * 0.6.6       2021-05-17 [1] CRAN (R 4.1.0)                       
##    forcats              * 0.5.1       2021-01-27 [1] CRAN (R 4.1.0)                       
##    foreach                1.5.1       2020-10-15 [1] CRAN (R 4.0.3)                       
##    forecast             * 8.14        2021-03-11 [1] CRAN (R 4.1.0)                       
##    foreign                0.8-81      2020-12-22 [1] CRAN (R 4.1.0)                       
##    formatR                1.10        2021-05-25 [1] CRAN (R 4.1.0)                       
##    formattable          * 0.2.1       2021-01-07 [1] CRAN (R 4.0.3)                       
##    Formula              * 1.2-4       2020-10-16 [1] CRAN (R 4.0.3)                       
##    fracdiff               1.5-1       2020-01-24 [1] CRAN (R 4.0.2)                       
##    frailtypack          * 3.3.2       2020-10-14 [1] CRAN (R 4.1.0)                       
##    fs                   * 1.5.0       2020-07-31 [1] CRAN (R 4.1.0)                       
##    futile.logger          1.4.3       2016-07-10 [1] CRAN (R 4.0.3)                       
##    futile.options         1.0.1       2018-04-20 [1] CRAN (R 4.0.3)                       
##    future                 1.21.0      2020-12-10 [1] CRAN (R 4.0.3)                       
##    future.apply           1.7.0       2021-01-04 [1] CRAN (R 4.0.3)                       
##    gdata                  2.18.0      2017-06-06 [1] CRAN (R 4.0.5)                       
##    gdtools                0.2.3       2021-01-06 [1] CRAN (R 4.0.3)                       
##    generics               0.1.0       2020-10-31 [1] CRAN (R 4.1.0)                       
##    ggExtra              * 0.9         2019-08-27 [1] CRAN (R 4.0.4)                       
##    ggforce              * 0.3.3       2021-03-05 [1] CRAN (R 4.1.0)                       
##    gghighlight          * 0.3.1       2020-12-12 [1] CRAN (R 4.0.3)                       
##    ggnewscale           * 0.4.5       2021-01-11 [1] CRAN (R 4.0.3)                       
##    ggplot2              * 3.3.5       2021-06-25 [1] CRAN (R 4.1.0)                       
##    ggpubr               * 0.4.0       2020-06-27 [1] CRAN (R 4.1.0)                       
##    ggrepel              * 0.9.1       2021-01-15 [1] CRAN (R 4.0.3)                       
##    ggridges               0.5.3       2021-01-08 [1] CRAN (R 4.0.3)                       
##    ggsignif               0.6.1       2021-02-23 [1] CRAN (R 4.0.5)                       
##    ggtext                 0.1.1       2020-12-17 [1] CRAN (R 4.0.3)                       
##    ggtree               * 3.0.1       2021-05-25 [1] Bioconductor                         
##    ggupset              * 0.3.0       2020-05-05 [1] CRAN (R 4.0.2)                       
##    globals                0.14.0      2020-11-22 [1] CRAN (R 4.0.3)                       
##    glue                   1.4.2       2020-08-27 [1] CRAN (R 4.1.0)                       
##    gmodels                2.18.1      2018-06-25 [1] CRAN (R 4.0.5)                       
##    goftest                1.2-2       2019-12-02 [1] CRAN (R 4.0.3)                       
##    grates                 0.2.0       2021-05-28 [1] CRAN (R 4.1.0)                       
##    gridExtra              2.3         2017-09-09 [1] CRAN (R 4.1.0)                       
##    gridtext               0.1.4       2020-12-10 [1] CRAN (R 4.0.3)                       
##    gt                     0.3.0       2021-05-12 [1] CRAN (R 4.1.0)                       
##    gtable                 0.3.0       2019-03-25 [1] CRAN (R 4.1.0)                       
##    gtools                 3.8.2       2020-03-31 [1] CRAN (R 4.0.3)                       
##    gtsummary            * 1.4.1       2021-05-19 [1] CRAN (R 4.1.0)                       
##    haven                  2.4.1       2021-04-23 [1] CRAN (R 4.1.0)                       
##    here                 * 1.0.1       2020-12-13 [1] CRAN (R 4.0.3)                       
##    highcharter          * 0.8.2       2020-07-26 [1] CRAN (R 4.0.5)                       
##    highr                  0.9         2021-04-16 [1] CRAN (R 4.1.0)                       
##    hms                    1.1.0       2021-05-17 [1] CRAN (R 4.1.0)                       
##    htmltools              0.5.1.1     2021-01-22 [1] CRAN (R 4.1.0)                       
##    htmlwidgets            1.5.3       2020-12-10 [1] CRAN (R 4.1.0)                       
##    httpuv                 1.6.1       2021-05-07 [1] CRAN (R 4.1.0)                       
##    httr                   1.4.2       2020-07-20 [1] CRAN (R 4.1.0)                       
##    i2extras             * 0.1.0       2021-03-30 [1] CRAN (R 4.1.0)                       
##    igraph                 1.2.6       2020-10-06 [1] CRAN (R 4.1.0)                       
##    imputeTS             * 3.2         2021-01-16 [1] CRAN (R 4.1.0)                       
##    incidence              1.7.3       2020-11-04 [1] CRAN (R 4.0.3)                       
##    incidence2           * 1.1         2021-05-29 [1] CRAN (R 4.1.0)                       
##    inline                 0.3.19      2021-05-31 [1] CRAN (R 4.1.0)                       
##    insight                0.14.1      2021-05-28 [1] CRAN (R 4.1.0)                       
##    ipred                  0.9-11      2021-03-12 [1] CRAN (R 4.0.5)                       
##    isoband                0.2.4       2021-03-03 [1] CRAN (R 4.1.0)                       
##    iterators              1.0.13      2020-10-15 [1] CRAN (R 4.0.3)                       
##    janitor              * 2.1.0       2021-01-05 [1] CRAN (R 4.0.3)                       
##    jpeg                   0.1-8.1     2019-10-24 [1] CRAN (R 4.0.0)                       
##    jquerylib              0.1.4       2021-04-26 [1] CRAN (R 4.1.0)                       
##    jsonlite               1.7.2       2020-12-09 [1] CRAN (R 4.1.0)                       
##    kableExtra           * 1.3.4       2021-02-20 [1] CRAN (R 4.0.5)                       
##    KernSmooth             2.23-20     2021-05-03 [1] CRAN (R 4.1.0)                       
##    km.ci                  0.5-2       2009-08-30 [1] CRAN (R 4.0.4)                       
##    KMsurv                 0.1-5       2012-12-03 [1] CRAN (R 4.0.3)                       
##    knitr                  1.33        2021-04-24 [1] CRAN (R 4.1.0)                       
##    labeling               0.4.2       2020-10-20 [1] CRAN (R 4.1.0)                       
##    labelled               2.8.0       2021-03-08 [1] CRAN (R 4.1.0)                       
##    lambda.r               1.2.4       2019-09-18 [1] CRAN (R 4.0.3)                       
##    later                  1.2.0       2021-04-23 [1] CRAN (R 4.1.0)                       
##    lattice                0.20-44     2021-05-02 [1] CRAN (R 4.1.0)                       
##    lava                   1.6.9       2021-03-11 [1] CRAN (R 4.0.5)                       
##    lazyeval               0.2.2       2019-03-15 [1] CRAN (R 4.1.0)                       
##    leafem                 0.1.6       2021-05-24 [1] CRAN (R 4.1.0)                       
##    leaflet                2.0.4.1     2021-01-07 [1] CRAN (R 4.0.3)                       
##    leaflet.providers      1.9.0       2019-11-09 [1] CRAN (R 4.0.3)                       
##    leafsync               0.1.0       2019-03-05 [1] CRAN (R 4.0.3)                       
##    LearnBayes             2.15.1      2018-03-18 [1] CRAN (R 4.0.3)                       
##    lifecycle              1.0.0       2021-02-15 [1] CRAN (R 4.1.0)                       
##    linelist             * 0.0.40.9000 2020-09-18 [1] Github (reconhub/linelist@cae034d)   
##    listenv                0.8.0       2019-12-05 [1] CRAN (R 4.0.2)                       
##    lmtest               * 0.9-38      2020-09-09 [1] CRAN (R 4.0.2)                       
##    loo                    2.4.1       2020-12-09 [1] CRAN (R 4.0.3)                       
##    lpSolve                5.6.15      2020-01-24 [1] CRAN (R 4.1.0)                       
##    lubridate            * 1.7.10      2021-02-26 [1] CRAN (R 4.0.5)                       
##    lwgeom                 0.2-6       2021-04-02 [1] CRAN (R 4.0.5)                       
##    magrittr             * 2.0.1       2020-11-17 [1] CRAN (R 4.1.0)                       
##    markdown               1.1         2019-08-07 [1] CRAN (R 4.1.0)                       
##    MASS                 * 7.3-54      2021-05-03 [1] CRAN (R 4.1.0)                       
##    matchmaker             0.1.1       2020-02-21 [1] CRAN (R 4.0.2)                       
##    Matrix               * 1.3-4       2021-06-01 [1] CRAN (R 4.1.0)                       
##    MatrixModels           0.5-0       2021-03-02 [1] CRAN (R 4.0.5)                       
##    matrixStats            0.58.0      2021-01-29 [1] CRAN (R 4.1.0)                       
##    mcmc                   0.9-7       2020-03-21 [1] CRAN (R 4.0.2)                       
##    MCMCpack               1.5-0       2021-01-20 [1] CRAN (R 4.1.0)                       
##    memoise                2.0.0       2021-01-26 [1] CRAN (R 4.1.0)                       
##    mgcv                   1.8-36      2021-06-01 [1] CRAN (R 4.1.0)                       
##    mice                 * 3.13.0      2021-01-27 [1] CRAN (R 4.0.3)                       
##    microbenchmark         1.4-7       2019-09-24 [1] CRAN (R 4.1.0)                       
##    mime                   0.11        2021-06-23 [1] CRAN (R 4.1.0)                       
##    miniUI                 0.1.1.1     2018-05-18 [1] CRAN (R 4.0.4)                       
##    mitools                2.4         2019-04-26 [1] CRAN (R 4.0.0)                       
##    modelr                 0.1.8       2020-05-19 [1] CRAN (R 4.0.2)                       
##    munsell                0.5.0       2018-06-12 [1] CRAN (R 4.1.0)                       
##    naniar               * 0.6.1       2021-05-14 [1] CRAN (R 4.1.0)                       
##    networkD3            * 0.4         2017-03-18 [1] CRAN (R 4.1.0)                       
##    nlme                   3.1-152     2021-02-04 [1] CRAN (R 4.1.0)                       
##    nnet                   7.3-16      2021-05-03 [1] CRAN (R 4.1.0)                       
##    numDeriv               2016.8-1.1  2019-06-06 [1] CRAN (R 4.1.0)                       
##    officer              * 0.3.18      2021-04-02 [1] CRAN (R 4.1.0)                       
##    OpenStreetMap        * 0.3.4       2019-05-31 [1] CRAN (R 4.1.1)                       
##    openxlsx               4.2.3       2020-10-27 [1] CRAN (R 4.1.0)                       
##    pacman                 0.5.1       2019-03-11 [1] CRAN (R 4.1.0)                       
##    parallelly             1.25.0      2021-04-30 [1] CRAN (R 4.1.0)                       
##    parameters           * 0.14.0      2021-05-29 [1] CRAN (R 4.1.0)                       
##    patchwork            * 1.1.1       2020-12-17 [1] CRAN (R 4.1.0)                       
##    PerformanceAnalytics * 2.0.4       2020-02-06 [1] CRAN (R 4.1.0)                       
##    PHEindicatormethods  * 1.3.2       2020-06-25 [1] CRAN (R 4.1.0)                       
##    pillar                 1.6.1       2021-05-16 [1] CRAN (R 4.1.0)                       
##    pkgbuild               1.2.0       2020-12-15 [1] CRAN (R 4.1.0)                       
##    pkgconfig              2.0.3       2019-09-22 [1] CRAN (R 4.1.0)                       
##    plotly               * 4.9.3       2021-01-10 [1] CRAN (R 4.1.0)                       
##    plotrix                3.8-1       2021-01-21 [1] CRAN (R 4.1.0)                       
##    plyr                   1.8.6       2020-03-03 [1] CRAN (R 4.1.0)                       
##    png                    0.1-7       2013-12-03 [1] CRAN (R 4.1.0)                       
##    polyclip               1.10-0      2019-03-14 [1] CRAN (R 4.1.0)                       
##    polyCub                0.8.0       2021-01-27 [1] CRAN (R 4.1.0)                       
##    prettyunits            1.1.1       2020-01-24 [1] CRAN (R 4.1.0)                       
##    pROC                   1.17.0.1    2021-01-13 [1] CRAN (R 4.1.0)                       
##    processx               3.5.2       2021-04-30 [1] CRAN (R 4.1.0)                       
##    prodlim                2019.11.13  2019-11-17 [1] CRAN (R 4.1.0)                       
##    progress               1.2.2       2019-05-16 [1] CRAN (R 4.1.0)                       
##    progressr              0.7.0       2020-12-11 [1] CRAN (R 4.1.0)                       
##    projections          * 0.5.4       2021-04-22 [1] CRAN (R 4.1.0)                       
##    promises               1.2.0.1     2021-02-11 [1] CRAN (R 4.1.0)                       
##    proxy                  0.4-26      2021-06-07 [1] CRAN (R 4.1.0)                       
##    ps                     1.6.0       2021-02-28 [1] CRAN (R 4.1.0)                       
##    purrr                * 0.3.4       2020-04-17 [1] CRAN (R 4.1.0)                       
##    quadprog               1.5-8       2019-11-20 [1] CRAN (R 4.1.0)                       
##    Quandl                 2.10.0      2019-06-12 [1] CRAN (R 4.1.0)                       
##    quantmod             * 0.4.18      2020-12-09 [1] CRAN (R 4.1.0)                       
##    quantreg               5.85        2021-02-24 [1] CRAN (R 4.1.0)                       
##    R.methodsS3            1.8.1       2020-08-26 [1] CRAN (R 4.1.0)                       
##    R.oo                   1.24.0      2020-08-26 [1] CRAN (R 4.1.0)                       
##    R.utils                2.10.1      2020-08-26 [1] CRAN (R 4.1.0)                       
##    R6                     2.5.0       2020-10-28 [1] CRAN (R 4.1.0)                       
##    raster                 3.4-10      2021-05-03 [1] CRAN (R 4.1.0)                       
##    RColorBrewer         * 1.1-2       2014-12-07 [1] CRAN (R 4.1.0)                       
##    Rcpp                 * 1.0.6       2021-01-15 [1] CRAN (R 4.1.0)                       
##  D RcppParallel           5.1.4       2021-05-04 [1] CRAN (R 4.1.0)                       
##    readr                * 1.4.0       2020-10-05 [1] CRAN (R 4.1.0)                       
##    readxl               * 1.3.1       2019-03-13 [1] CRAN (R 4.1.0)                       
##    RecordLinkage        * 0.4-12.1    2020-08-25 [1] CRAN (R 4.1.0)                       
##    remotes                2.3.0       2021-04-01 [1] CRAN (R 4.1.0)                       
##    renv                   0.13.2      2021-03-30 [1] CRAN (R 4.1.0)                       
##    repr                   1.1.3       2021-01-21 [1] CRAN (R 4.1.0)                       
##    reprex                 2.0.0       2021-04-02 [1] CRAN (R 4.1.0)                       
##    reshape2               1.4.4       2020-04-09 [1] CRAN (R 4.1.0)                       
##    rgdal                  1.5-23      2021-02-03 [1] CRAN (R 4.1.0)                       
##    rio                  * 0.5.26      2021-03-01 [1] CRAN (R 4.1.0)                       
##  D rJava                  1.0-4       2021-04-29 [1] CRAN (R 4.1.0)                       
##    rlang                  0.4.11      2021-04-30 [1] CRAN (R 4.1.0)                       
##    rlist                  0.4.6.1     2016-04-04 [1] CRAN (R 4.1.0)                       
##    rmarkdown              2.9         2021-06-15 [1] CRAN (R 4.1.0)                       
##    rootSolve              1.8.2.1     2020-04-27 [1] CRAN (R 4.1.0)                       
##    rpart                  4.1-15      2019-04-12 [1] CRAN (R 4.1.0)                       
##    rprojroot              2.0.2       2020-11-15 [1] CRAN (R 4.1.0)                       
##    RSQLite              * 2.2.7       2021-04-22 [1] CRAN (R 4.1.0)                       
##    rstan                  2.21.2      2020-07-27 [1] CRAN (R 4.1.0)                       
##    rstatix              * 0.7.0       2021-02-13 [1] CRAN (R 4.1.0)                       
##    rstudioapi             0.13        2020-11-12 [1] CRAN (R 4.1.0)                       
##    runner                 0.4.0       2021-04-22 [1] CRAN (R 4.1.0)                       
##    rvcheck                0.1.8       2020-03-01 [1] CRAN (R 4.1.0)                       
##    rvest                  1.0.0       2021-03-09 [1] CRAN (R 4.1.0)                       
##    s2                     1.0.6       2021-06-17 [1] CRAN (R 4.1.0)                       
##    sass                   0.4.0       2021-05-12 [1] CRAN (R 4.1.0)                       
##    scales               * 1.1.1       2020-05-11 [1] CRAN (R 4.1.0)                       
##    scatterplot3d          0.3-41      2018-03-14 [1] CRAN (R 4.0.3)                       
##    see                  * 0.6.4       2021-05-29 [1] CRAN (R 4.1.0)                       
##    SemiCompRisks        * 3.4         2021-02-03 [1] CRAN (R 4.0.4)                       
##    sessioninfo            1.1.1       2018-11-05 [1] CRAN (R 4.1.0)                       
##    sf                   * 1.0-1       2021-06-29 [1] CRAN (R 4.1.0)                       
##    shiny                * 1.6.0       2021-01-25 [1] CRAN (R 4.1.0)                       
##    sitrep               * 0.1.7       2021-07-12 [1] Github (R4EPI/sitrep@9a57f33)        
##    skimr                * 2.1.3       2021-03-07 [1] CRAN (R 4.1.0)                       
##    slider               * 0.2.1       2021-03-23 [1] CRAN (R 4.0.5)                       
##    snakecase              0.11.0      2019-05-25 [1] CRAN (R 4.1.0)                       
##    sp                   * 1.4-5       2021-01-10 [1] CRAN (R 4.1.0)                       
##    SparseM                1.81        2021-02-18 [1] CRAN (R 4.1.0)                       
##    spatstat               2.1-0       2021-04-03 [1] CRAN (R 4.0.5)                       
##    spatstat.core          2.1-2       2021-04-18 [1] CRAN (R 4.1.0)                       
##    spatstat.data          2.1-0       2021-03-21 [1] CRAN (R 4.0.5)                       
##    spatstat.geom          2.1-0       2021-04-15 [1] CRAN (R 4.1.0)                       
##    spatstat.linnet        2.1-1       2021-03-28 [1] CRAN (R 4.0.5)                       
##    spatstat.sparse        2.0-0       2021-03-16 [1] CRAN (R 4.0.5)                       
##    spatstat.utils         2.1-0       2021-03-15 [1] CRAN (R 4.0.5)                       
##    spData               * 0.3.8       2020-07-03 [1] CRAN (R 4.0.5)                       
##    spdep                * 1.1-8       2021-05-23 [1] CRAN (R 4.1.0)                       
##    srvyr                * 1.0.1       2021-03-28 [1] CRAN (R 4.0.5)                       
##    StanHeaders            2.21.0-7    2020-12-17 [1] CRAN (R 4.0.3)                       
##    stars                * 0.5-2       2021-03-17 [1] CRAN (R 4.1.0)                       
##    statmod                1.4.36      2021-05-10 [1] CRAN (R 4.1.0)                       
##    stinepack              1.4         2018-07-30 [1] CRAN (R 4.0.3)                       
##    stringdist           * 0.9.6.3     2020-10-09 [1] CRAN (R 4.0.3)                       
##    stringi                1.6.2       2021-05-17 [1] CRAN (R 4.1.0)                       
##    stringr              * 1.4.0       2019-02-10 [1] CRAN (R 4.1.0)                       
##    survC1               * 1.0-3       2021-02-10 [1] CRAN (R 4.1.0)                       
##    surveillance         * 1.19.1      2021-03-31 [1] CRAN (R 4.1.0)                       
##    survey               * 4.0         2020-04-03 [1] CRAN (R 4.0.0)                       
##    survival             * 3.2-11      2021-04-26 [1] CRAN (R 4.1.0)                       
##    survminer            * 0.4.9       2021-03-09 [1] CRAN (R 4.0.5)                       
##    survMisc               0.5.5       2018-07-05 [1] CRAN (R 4.0.4)                       
##    svglite                2.0.0       2021-02-20 [1] CRAN (R 4.0.5)                       
##    systemfonts            1.0.2       2021-05-11 [1] CRAN (R 4.1.0)                       
##    tensor                 1.5         2012-05-05 [1] CRAN (R 4.1.0)                       
##    tibble               * 3.1.2       2021-05-16 [1] CRAN (R 4.1.0)                       
##    tidyquant            * 1.0.3       2021-03-05 [1] CRAN (R 4.1.0)                       
##    tidyr                * 1.1.3       2021-03-03 [1] CRAN (R 4.1.0)                       
##    tidyselect             1.1.1       2021-04-30 [1] CRAN (R 4.1.0)                       
##    tidytree               0.3.4       2021-05-22 [1] CRAN (R 4.1.0)                       
##    tidyverse            * 1.3.1       2021-04-15 [1] CRAN (R 4.1.0)                       
##    timeDate               3043.102    2018-02-21 [1] CRAN (R 4.1.0)                       
##    tmap                 * 3.3-1       2021-03-15 [1] CRAN (R 4.1.0)                       
##    tmaptools            * 3.1-1       2021-01-19 [1] CRAN (R 4.1.0)                       
##    treeio               * 1.16.1      2021-05-23 [1] Bioconductor                         
##    trending             * 0.0.3       2021-04-19 [1] CRAN (R 4.1.0)                       
##    truncnorm              1.0-8       2018-02-27 [1] CRAN (R 4.1.0)                       
##    tseries                0.10-48     2020-12-04 [1] CRAN (R 4.1.0)                       
##    tsibble              * 1.0.1       2021-04-12 [1] CRAN (R 4.1.0)                       
##    TTR                  * 0.24.2      2020-09-01 [1] CRAN (R 4.1.0)                       
##    tweenr                 1.0.2       2021-03-23 [1] CRAN (R 4.1.0)                       
##    tzdb                   0.1.1       2021-04-22 [1] CRAN (R 4.1.0)                       
##    units                * 0.7-2       2021-06-08 [1] CRAN (R 4.1.0)                       
##    UpSetR               * 1.4.0       2019-05-22 [1] CRAN (R 4.1.0)                       
##    urca                   1.3-0       2016-09-06 [1] CRAN (R 4.1.0)                       
##    utf8                   1.2.1       2021-03-12 [1] CRAN (R 4.1.0)                       
##    uuid                   0.1-4       2020-02-26 [1] CRAN (R 4.1.0)                       
##    V8                     3.4.2       2021-05-01 [1] CRAN (R 4.1.0)                       
##    vctrs                  0.3.8       2021-04-29 [1] CRAN (R 4.1.0)                       
##    viridis              * 0.6.1       2021-05-11 [1] CRAN (R 4.1.0)                       
##    viridisLite          * 0.4.0       2021-04-13 [1] CRAN (R 4.1.0)                       
##    visdat                 0.5.3       2019-02-15 [1] CRAN (R 4.1.0)                       
##    visNetwork           * 2.0.9       2019-12-06 [1] CRAN (R 4.1.0)                       
##    vistime              * 1.2.1       2021-04-10 [1] CRAN (R 4.1.0)                       
##    warp                   0.2.0       2020-10-21 [1] CRAN (R 4.1.0)                       
##    webshot              * 0.5.2       2019-11-22 [1] CRAN (R 4.1.0)                       
##    withr                  2.4.2       2021-04-18 [1] CRAN (R 4.1.0)                       
##    wk                     0.4.1       2021-03-16 [1] CRAN (R 4.1.0)                       
##    writexl              * 1.4.0       2021-04-20 [1] CRAN (R 4.1.0)                       
##    xfun                   0.24        2021-06-15 [1] CRAN (R 4.1.0)                       
##    XML                    3.99-0.6    2021-03-16 [1] CRAN (R 4.1.0)                       
##    xml2                   1.3.2       2020-04-23 [1] CRAN (R 4.1.0)                       
##    xtable               * 1.8-4       2019-04-21 [1] CRAN (R 4.1.0)                       
##    xts                  * 0.12.1      2020-09-09 [1] CRAN (R 4.1.0)                       
##    yaml                   2.2.1       2020-02-01 [1] CRAN (R 4.1.0)                       
##    yardstick            * 0.0.8       2021-03-28 [1] CRAN (R 4.1.0)                       
##    zip                    2.2.0       2021-05-31 [1] CRAN (R 4.1.0)                       
##    zoo                  * 1.8-9       2021-03-09 [1] CRAN (R 4.1.0)                       
## 
## [1] C:/Users/neale/OneDrive - Neale Batra/Documents/Analytics-LAPTOP-RS5P2IBO/R/Projects/R handbook/epiRhandbook_eng/renv/library/R-4.1/x86_64-w64-mingw32
## [2] C:/Users/neale/AppData/Local/Temp/Rtmpc1lSg5/renv-system-library
## 
##  D -- DLL MD5 mismatch, broken installation.

2 Download handbook and data

2.1 Download offline handbook

You can download the offline version of this handbook as an HTML file so that you can view the file in your web browser even if you no longer have internet access. If you are considering offline use of the Epi R Handbook here are a few things to consider:

  • When you open the file it may take a minute or two for the images and the Table of Contents to load
  • The offline handbook has a slightly different layout - one very long page with Table of Contents on the left. To search for specific terms use Ctrl+f (Cmd-f)
  • See the Suggested packages page to assist you with installing appropriate R packages before you lose internet connectivity
  • Install our R package epirhandbook that contains all the example data (install process described below)

There are two ways you can download the handbook:

Use our R package

We offer an R package called epirhandbook. It includes a function download_book() that downloads the handbook file from our Github repository to your computer.

This package also contains a function get_data() that downloads all the example data to your computer.

Run the following code to install our R package epirhandbook from the Github repository appliedepi. This package is not on CRAN, so use the special function p_install_gh() to install it from Github.

# install the latest version of the Epi R Handbook package
pacman::p_install_gh("appliedepi/epirhandbook")

Now, load the package for use in your current R session:

# load the package for use
pacman::p_load(epirhandbook)

Next, run the package’s function download_book() (with empty parentheses) to download the handbook to your computer. Assuming you are in RStudio, a window will appear allowing you to select a save location.

# download the offline handbook to your computer
download_book()

2.2 Download data to follow along

To “follow along” with the handbook pages, you can download the example data and outputs.

Use our R package

The easiest approach to download all the data is to install our R package epirhandbook. It contains a function get_data() that saves all the example data to a folder of your choice on your computer.

To install our R package epirhandbook, run the following code. This package is not on CRAN, so use the function p_install_gh() to install it. The input is referencing our Github organisation (“appliedepi”) and the epirhandbook package.

# install the latest version of the Epi R Handbook package
pacman::p_install_gh("appliedepi/epirhandbook")

Now, load the package for use in your current R session:

# load the package for use
pacman::p_load(epirhandbook)

Next, use the package’s function get_data() to download the example data to your computer. Run get_data("all") to get all the example data, or provide a specific file name and extension within the quotes to retrieve only one file.

The data have already been downloaded with the package, and simply need to be transferred out to a folder on your computer. A pop-up window will appear, allowing you to select a save folder location. We suggest you create a new “data” folder as there are about 30 files (including example data and example outputs).

# download all the example data into a folder on your computer
get_data("all")

# download only the linelist example data into a folder on your computer
get_data(file = "linelist_cleaned.rds")
# download a specific file into a folder on your computer
get_data("linelist_cleaned.rds")

Once you have used get_data() to save a file to your computer, you will still need to import it into R. See the Import and export page for details.

If you wish, you can review all the data used in this handbook in the “data” folder of our Github repository.

Download one-by-one

This option involves downloading the data file-by-file from our Github repository via either a link or an R command specific to the file. Some file types allow a download button, while others can be downloaded via an R command.

Case linelist

This is a fictional Ebola outbreak, expanded by the handbook team from the ebola_sim practice dataset in the outbreaks package.

Other related files:

pacman::p_load(rio) # install/load the rio package

# import the file directly from Github
cleaning_dict <- import("https://github.com/appliedepi/epirhandbook_eng/raw/master/data/case_linelists/cleaning_dict.csv")

Malaria count data

These data are fictional counts of malaria cases by age group, facility, and day. A .rds file is an R-specific file type that preserves column classes. This ensures you will have only minimal cleaning to do after importing the data into R.

Click to download the malaria count data (.rds file)

Likert-scale data

These are fictional data from a Likert-style survey, used in the page on Demographic pyramids and Likert-scales. You can load these data directly into R by running the following commands:

pacman::p_load(rio) # install/load the rio package

# import the file directly from Github
likert_data <- import("https://raw.githubusercontent.com/appliedepi/epirhandbook_eng/master/data/likert_data.csv")

Flexdashboard

Below are links to the file associated with the page on Dashboards with R Markdown:

  • To download the R Markdown for the outbreak dashboard, right-click this link (Cmd+click for Mac) and select “Save link as”.
  • To download the HTML dashboard, right-click this link (Cmd+click for Mac) and select “Save link as”.

Contact Tracing

The Contact Tracing page demonstrated analysis of contact tracing data, using example data from Go.Data. The data used in the page can be downloaded as .rds files by clicking the following links:

Click to download the case investigation data (.rds file)

Click to download the contact registration data (.rds file)

Click to download the contact follow-up data (.rds file)

NOTE: Structured contact tracing data from other software (e.g. KoBo, DHIS2 Tracker, CommCare) may look different. If you would like to contribute alternative sample data or content for this page, please contact us.

TIP: If you are deploying Go.Data and want to connect to your instance’s API, see the Import and export page (API section) and the Go.Data Community of Practice.

GIS

Shapefiles have many sub-component files, each with a different file extention. One file will have the “.shp” extension, but others may have “.dbf”, “.prj”, etc.

The GIS basics page provides links to the Humanitarian Data Exchange website where you can download the shapefiles directly as zipped files.

For example, the health facility points data can be downloaded here. Download “hotosm_sierra_leone_health_facilities_points_shp.zip”. Once saved to your computer, “unzip” the folder. You will see several files with different extensions (e.g. “.shp”, “.prj”, “.shx”) - all these must be saved to the same folder on your computer. Then to import into R, provide the file path and name of the “.shp” file to st_read() from the sf package (as described in the GIS basics page).

If you follow Option 1 to download all the example data (via our R package epirhandbook), all the shapefiles are included.

Alternatively, you can download the shapefiles from the R Handbook Github “data” folder (see the “gis” sub-folder). However, be aware that you will need to download each sub-file individually to your computer. In Github, click on each file individually and download them by clicking on the “Download” button. Below, you can see how the shapefile “sle_adm3” consists of many files - each of which would need to be downloaded from Github.

Phylogenetic trees

See the page on Phylogenetic trees. Newick file of phylogenetic tree constructed from whole genome sequencing of 299 Shigella sonnei samples and corresponding sample data (converted to a text file). The Belgian samples and resulting data are kindly provided by the Belgian NRC for Salmonella and Shigella in the scope of a project conducted by an ECDC EUPHEM Fellow, and will also be published in a manuscript. The international data are openly available on public databases (ncbi) and have been previously published.

  • To download the “Shigella_tree.txt” phylogenetic tree file, right-click this link (Cmd+click for Mac) and select “Save link as”.
  • To download the “sample_data_Shigella_tree.csv” with additional information on each sample, right-click this link (Cmd+click for Mac) and select “Save link as”.
  • To see the new, created subset-tree, right-click this link (Cmd+click for Mac) and select “Save link as”. The .txt file will download to your computer.

You can then import the .txt files with read.tree() from the ape package, as explained in the page.

ape::read.tree("Shigella_tree.txt")

Standardization

See the page on Standardised rates. You can load the data directly from our Github repository on the internet into your R session with the following commands:

# install/load the rio package
pacman::p_load(rio) 

##############
# Country A
##############
# import demographics for country A directly from Github
A_demo <- import("https://github.com/appliedepi/epirhandbook_eng/raw/master/data/standardization/country_demographics.csv")

# import deaths for country A directly from Github
A_deaths <- import("https://github.com/appliedepi/epirhandbook_eng/raw/master/data/standardization/deaths_countryA.csv")

##############
# Country B
##############
# import demographics for country B directly from Github
B_demo <- import("https://github.com/appliedepi/epirhandbook_eng/raw/master/data/standardization/country_demographics_2.csv")

# import deaths for country B directly from Github
B_deaths <- import("https://github.com/appliedepi/epirhandbook_eng/raw/master/data/standardization/deaths_countryB.csv")


###############
# Reference Pop
###############
# import demographics for country B directly from Github
standard_pop_data <- import("https://github.com/appliedepi/epirhandbook_eng/raw/master/data/standardization/world_standard_population_by_sex.csv")

Time series and outbreak detection

See the page on Time series and outbreak detection. We use campylobacter cases reported in Germany 2002-2011, as available from the surveillance R package. (nb. this dataset has been adapted from the original, in that 3 months of data have been deleted from the end of 2011 for demonstration purposes)

Click to download Campylobacter in Germany (.xlsx)

We also use climate data from Germany 2002-2011 (temperature in degrees celsius and rain fail in millimetres) . These were downloaded from the EU Copernicus satellite reanalysis dataset using the ecmwfr package. You will need to download all of these and import them with stars::read_stars() as explained in the time series page.

Click to download Germany weather 2002 (.nc file)

Click to download Germany weather 2003 (.nc file)

Click to download Germany weather 2004 (.nc file)

Click to download Germany weather 2005 (.nc file)

Click to download Germany weather 2006 (.nc file)

Click to download Germany weather 2007 (.nc file)

Click to download Germany weather 2008 (.nc file)

Click to download Germany weather 2009 (.nc file)

Click to download Germany weather 2010 (.nc file)

Click to download Germany weather 2011 (.nc file)

Survey analysis

For the survey analysis page we use fictional mortality survey data based off MSF OCA survey templates. This fictional data was generated as part of the “R4Epis” project.

Click to download Fictional survey data (.xlsx)

Click to download Fictional survey data dictionary (.xlsx)

Click to download Fictional survey population data (.xlsx)

Shiny

The page on Dashboards with Shiny demonstrates the construction of a simple app to display malaria data.

To download the R files that produce the Shiny app:

You can click here to download the app.R file that contains both the UI and Server code for the Shiny app.

You can click here to download the facility_count_data.rds file that contains malaria data for the Shiny app. Note that you may need to store it within a “data” folder for the here() file paths to work correctly.

You can click here to download the global.R file that should run prior to the app opening, as explained in the page.

You can click here to download the plot_epicurve.R file that is sourced by global.R. Note that you may need to store it within a “funcs” folder for the here() file paths to work correctly.

(PART) Basics

3 R Basics

Welcome!

Esta página analisa o essencial do R. Não pretende ser um tutorial abrangente, mas fornece o básico e pode ser útil para refrescar a sua memória. A secção sobre Recursos para a aprendizagem contém links para tutoriais mais abrangentes.

Parts of this page have been adapted with permission from the R4Epis project.

See the page on Transition to R for tips on switching to R from STATA, SAS, or Excel.

3.1 Why use R?

As stated on the R project website, R is a programming language and environment for statistical computing and graphics. It is highly versatile, extendable, and community-driven.

Cost

R is free to use! There is a strong ethic in the community of free and open-source material.

Reproducibility

Conducting your data management and analysis through a programming language (compared to Excel or another primarily point-click/manual tool) enhances reproducibility, makes error-detection easier, and eases your workload.

Community

The R community of users is enormous and collaborative. New packages and tools to address real-life problems are developed daily, and vetted by the community of users. As one example, R-Ladies is a worldwide organization whose mission is to promote gender diversity in the R community, and is one of the largest organizations of R users. It likely has a chapter near you!

3.2 Key terms

RStudio - RStudio is a Graphical User Interface (GUI) for easier use of R. Read more in the RStudio section.

Objects - Everything you store in R - datasets, variables, a list of village names, a total population number, even outputs such as graphs - are objects which are assigned a name and can be referenced in later commands. Read more in the Objects section.

Functions - A function is a code operation that accept inputs and returns a transformed output. Read more in the Functions section.

Packages - An R package is a shareable bundle of functions. Read more in the Packages section.

Scripts - A script is the document file that hold your commands. Read more in the Scripts section

3.3 Resources for learning

Resources within RStudio

Help documentation

Search the RStudio “Help” tab for documentation on R packages and specific functions. This is within the pane that also contains Files, Plots, and Packages (typically in the lower-right pane). As a shortcut, you can also type the name of a package or function into the R console after a question-mark to open the relevant Help page. Do not include parentheses.

For example: ?filter or ?diagrammeR.

Interactive tutorials

There are several ways to learn R interactively within RStudio.

RStudio itself offers a Tutorial pane that is powered by the learnr R package. Simply install this package and open a tutorial via the new “Tutorial” tab in the upper-right RStudio pane (which also contains Environment and History tabs).

The R package swirl offers interactive courses in the R Console. Install and load this package, then run the command swirl() (empty parentheses) in the R console. You will see prompts appear in the Console. Respond by typing in the Console. It will guide you through a course of your choice.

Cheatsheets

There are many PDF “cheatsheets” available on the RStudio website, for example:

  • Factors with forcats package
  • Dates and times with lubridate package
  • Strings with stringr package
  • iterative opertaions with purrr package
  • Data import
  • Data transformation cheatsheet with dplyr package
  • R Markdown (to create documents like PDF, Word, Powerpoint…)
  • Shiny (to build interactive web apps)
  • Data visualization with ggplot2 package
  • Cartography (GIS)
  • leaflet package (interactive maps)
  • Python with R (reticulate package)

This is an online R resource specifically for Excel users

Twitter

R has a vibrant twitter community where you can learn tips, shortcuts, and news - follow these accounts:

Also:

#epitwitter and #rstats

Free online resources

A definitive text is the R for Data Science book by Garrett Grolemund and Hadley Wickham

The R4Epis project website aims to “develop standardised data cleaning, analysis and reporting tools to cover common types of outbreaks and population-based surveys that would be conducted in an MSF emergency response setting.” You can find R basics training materials, templates for RMarkdown reports on outbreaks and surveys, and tutorials to help you set them up.

3.4 Installation

R and RStudio

How to install R

Visit this website https://www.r-project.org/ and download the latest version of R suitable for your computer.

How to install RStudio

Visit this website https://rstudio.com/products/rstudio/download/ and download the latest free Desktop version of RStudio suitable for your computer.

Permissions
Note that you should install R and RStudio to a drive where you have read and write permissions. Otherwise, your ability to install R packages (a frequent occurrence) will be impacted. If you encounter problems, try opening RStudio by right-clicking the icon and selecting “Run as administrator”. Other tips can be found in the page R on network drives.

How to update R and RStudio

Your version of R is printed to the R Console at start-up. You can also run sessionInfo().

To update R, go to the website mentioned above and re-install R. Alternatively, you can use the installr package (on Windows) by running installr::updateR(). This will open dialog boxes to help you download the latest R version and update your packages to the new R version. More details can be found in the installr documentation.

Be aware that the old R version will still exist in your computer. You can temporarily run an older version (older “installation”) of R by clicking “Tools” -> “Global Options” in RStudio and choosing an R version. This can be useful if you want to use a package that has not been updated to work on the newest version of R.

To update RStudio, you can go to the website above and re-download RStudio. Another option is to click “Help” -> “Check for Updates” within RStudio, but this may not show the very latest updates.

To see which versions of R, RStudio, or packages were used when this Handbook as made, see the page on Editorial and technical notes.

Other software you may need to install

  • TinyTeX (for compiling an RMarkdown document to PDF)
  • Pandoc (for compiling RMarkdown documents)
  • RTools (for building packages for R)
  • phantomjs (for saving still images of animated networks, such as transmission chains)

TinyTex

TinyTex is a custom LaTeX distribution, useful when trying to produce PDFs from R.
See https://yihui.org/tinytex/ for more informaton.

To install TinyTex from R:

install.packages('tinytex')
tinytex::install_tinytex()
# to uninstall TinyTeX, run tinytex::uninstall_tinytex()

Pandoc

Pandoc is a document converter, a separate software from R. It comes bundled with RStudio and should not need to be downloaded. It helps the process of converting Rmarkdown documents to formats like .pdf and adding complex functionality.

RTools

RTools is a collection of software for building packages for R

Install from this website: https://cran.r-project.org/bin/windows/Rtools/

phantomjs

This is often used to take “screenshots” of webpages. For example when you make a transmission chain with epicontacts package, an HTML file is produced that is interactive and dynamic. If you want a static image, it can be useful to use the webshot package to automate this process. This will require the external program “phantomjs”. You can install phantomjs via the webshot package with the command webshot::install_phantomjs().

3.5 RStudio

RStudio orientation

First, open RStudio. As their icons can look very similar, be sure you are opening RStudio and not R.

For RStudio to work you must also have R installed on the computer (see above for installation instructions).

RStudio is an interface (GUI) for easier use of R. You can think of R as being the engine of a vehicle, doing the crucial work, and RStudio as the body of the vehicle (with seats, accessories, etc.) that helps you actually use the engine to move forward! You can see the complete RStudio user-interface cheatsheet (PDF) here

By default RStudio displays four rectangle panes.

TIP: If your RStudio displays only one left pane it is because you have no scripts open yet.

The Source Pane
This pane, by default in the upper-left, is a space to edit, run, and save your scripts. Scripts contain the commands you want to run. This pane can also display datasets (data frames) for viewing.

For Stata users, this pane is similar to your Do-file and Data Editor windows.

The R Console Pane

The R Console, by default the left or lower-left pane in R Studio, is the home of the R “engine”. This is where the commands are actually run and non-graphic outputs and error/warning messages appear. You can directly enter and run commands in the R Console, but realize that these commands are not saved as they are when running commands from a script.

If you are familiar with Stata, the R Console is like the Command Window and also the Results Window.

The Environment Pane
This pane, by default in the upper-right, is most often used to see brief summaries of objects in the R Environment in the current session. These objects could include imported, modified, or created datasets, parameters you have defined (e.g. a specific epi week for the analysis), or vectors or lists you have defined during analysis (e.g. names of regions). You can click on the arrow next to a data frame name to see its variables.

In Stata, this is most similar to the Variables Manager window.

This pane also contains History where you can see commands that you can previously. It also has a “Tutorial” tab where you can complete interactive R tutorials if you have the learnr package installed. It also has a “Connections” pane for external connections, and can have a “Git” pane if you choose to interface with Github.

Plots, Viewer, Packages, and Help Pane
The lower-right pane includes several important tabs. Typical plot graphics including maps will display in the Plot pane. Interactive or HTML outputs will display in the Viewer pane. The Help pane can display documentation and help files. The Files pane is a browser which can be used to open or delete files. The Packages pane allows you to see, install, update, delete, load/unload R packages, and see which version of the package you have. To learn more about packages see the packages section below.

This pane contains the Stata equivalents of the Plots Manager and Project Manager windows.

RStudio settings

Change RStudio settings and appearance in the Tools drop-down menu, by selecting Global Options. There you can change the default settings, including appearance/background color.

Restart

If your R freezes, you can re-start R by going to the Session menu and clicking “Restart R”. This avoids the hassle of closing and opening RStudio. Everything in your R environment will be removed when you do this.

Keyboard shortcuts

Some very useful keyboard shortcuts are below. See all the keyboard shortcuts for Windows, Max, and Linux in the second page of this RStudio user interface cheatsheet.

Windows/Linux Mac Action
Esc Esc Interrupt current command (useful if you accidentally ran an incomplete command and cannot escape seeing “+” in the R console)
Ctrl+s Cmd+s Save (script)
Tab Tab Auto-complete
Ctrl + Enter Cmd + Enter Run current line(s)/selection of code
Ctrl + Shift + C Cmd + Shift + c comment/uncomment the highlighted lines
Alt + - Option + - Insert <-
Ctrl + Shift + m Cmd + Shift + m Insert %>%
Ctrl + l Cmd + l Clear the R console
Ctrl + Alt + b Cmd + Option + b Run from start to current line
Ctrl + Alt + t Cmd + Option + t Run the current code section (R Markdown)
Ctrl + Alt + i Cmd + Shift + r Insert code chunk (into R Markdown)
Ctrl + Alt + c Cmd + Option + c Run current code chunk (R Markdown)
up/down arrows in R console Same Toggle through recently run commands
Shift + up/down arrows in script Same Select multiple code lines
Ctrl + f Cmd + f Find and replace in current script
Ctrl + Shift + f Cmd + Shift + f Find in files (search/replace across many scripts)
Alt + l Cmd + Option + l Fold selected code
Shift + Alt + l Cmd + Shift + Option+l Unfold selected code

TIP: Use your Tab key when typing to engage RStudio’s auto-complete functionality. This can prevent spelling errors. Press Tab while typing to produce a drop-down menu of likely functions and objects, based on what you have typed so far.

3.6 Functions

Functions are at the core of using R. Functions are how you perform tasks and operations. Many functions come installed with R, many more are available for download in packages (explained in the packages section), and you can even write your own custom functions!

This basics section on functions explains:

  • What a function is and how they work
  • What function arguments are
  • How to get help understanding a function

A quick note on syntax: In this handbook, functions are written in code-text with open parentheses, like this: filter(). As explained in the packages section, functions are downloaded within packages. In this handbook, package names are written in bold, like dplyr. Sometimes in example code you may see the function name linked explicitly to the name of its package with two colons (::) like this: dplyr::filter(). The purpose of this linkage is explained in the packages section.

Simple functions

A function is like a machine that receives inputs, does some action with those inputs, and produces an output. What the output is depends on the function.

Functions typically operate upon some object placed within the function’s parentheses. For example, the function sqrt() calculates the square root of a number:

sqrt(49)
## [1] 7

The object provided to a function also can be a column in a dataset (see the Objects section for detail on all the kinds of objects). Because R can store multiple datasets, you will need to specify both the dataset and the column. One way to do this is using the $ notation to link the name of the dataset and the name of the column (dataset$column). In the example below, the function summary() is applied to the numeric column age in the dataset linelist, and the output is a summary of the column’s numeric and missing values.

# Print summary statistics of column 'age' in the dataset 'linelist'
summary(linelist$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    6.00   13.00   16.07   23.00   84.00      86

NOTE: Behind the scenes, a function represents complex additional code that has been wrapped up for the user into one easy command.

Functions with multiple arguments

Functions often ask for several inputs, called arguments, located within the parentheses of the function, usually separated by commas.

  • Some arguments are required for the function to work correctly, others are optional
  • Optional arguments have default settings
  • Arguments can take character, numeric, logical (TRUE/FALSE), and other inputs

Here is a fun fictional function, called oven_bake(), as an example of a typical function. It takes an input object (e.g. a dataset, or in this example “dough”) and performs operations on it as specified by additional arguments (minutes = and temperature =). The output can be printed to the console, or saved as an object using the assignment operator <-.

In a more realistic example, the age_pyramid() command below produces an age pyramid plot based on defined age groups and a binary split column, such as gender. The function is given three arguments within the parentheses, separated by commas. The values supplied to the arguments establish linelist as the dataframe to use, age_cat5 as the column to count, and gender as the binary column to use for splitting the pyramid by color.

# Create an age pyramid
age_pyramid(data = linelist, age_group = "age_cat5", split_by = "gender")

The above command can be equivalently written as below, in a longer style with a new line for each argument. This style can be easier to read, and easier to write “comments” with # to explain each part (commenting extensively is good practice!). To run this longer command you can highlight the entire command and click “Run”, or just place your cursor in the first line and then press the Ctrl and Enter keys simultaneously.

# Create an age pyramid
age_pyramid(
  data = linelist,        # use case linelist
  age_group = "age_cat5", # provide age group column
  split_by = "gender"     # use gender column for two sides of pyramid
  )

The first half of an argument assignment (e.g. data =) does not need to be specified if the arguments are written in a specific order (specified in the function’s documentation). The below code produces the exact same pyramid as above, because the function expects the argument order: data frame, age_group variable, split_by variable.

# This command will produce the exact same graphic as above
age_pyramid(linelist, "age_cat5", "gender")

A more complex age_pyramid() command might include the optional arguments to:

  • Show proportions instead of counts (set proportional = TRUE when the default is FALSE)
  • Specify the two colors to use (pal = is short for “palette” and is supplied with a vector of two color names. See the objects page for how the function c() makes a vector)

NOTE: For arguments that you specify with both parts of the argument (e.g. proportional = TRUE), their order among all the arguments does not matter.

age_pyramid(
  linelist,                    # use case linelist
  "age_cat5",                  # age group column
  "gender",                    # split by gender
  proportional = TRUE,         # percents instead of counts
  pal = c("orange", "purple")  # colors
  )

Writing Functions

R is a language that is oriented around functions, so you should feel empowered to write your own functions. Creating functions brings several advantages:

  • To facilitate modular programming - the separation of code in to independent and manageable pieces
  • Replace repetitive copy-and-paste, which can be error prone
  • Give pieces of code memorable names

How to write a function is covered in-depth in the Writing functions page.

3.7 Packages

Packages contain functions.

An R package is a shareable bundle of code and documentation that contains pre-defined functions. Users in the R community develop packages all the time catered to specific problems, it is likely that one can help with your work! You will install and use hundreds of packages in your use of R.

On installation, R contains “base” packages and functions that perform common elementary tasks. But many R users create specialized functions, which are verified by the R community and which you can download as a package for your own use. In this handbook, package names are written in bold. One of the more challenging aspects of R is that there are often many functions or packages to choose from to complete a given task.

Install and load

Functions are contained within packages which can be downloaded (“installed”) to your computer from the internet. Once a package is downloaded, it is stored in your “library”. You can then access the functions it contains during your current R session by “loading” the package.

Think of R as your personal library: When you download a package, your library gains a new book of functions, but each time you want to use a function in that book, you must borrow (“load”) that book from your library.

In summary: to use the functions available in an R package, 2 steps must be implemented:

  1. The package must be installed (once), and
  2. The package must be loaded (each R session)

Your library

Your “library” is actually a folder on your computer, containing a folder for each package that has been installed. Find out where R is installed in your computer, and look for a folder called “win-library”. For example: R\win-library\4.0 (the 4.0 is the R version - you’ll have a different library for each R version you’ve downloaded).

You can print the file path to your library by entering .libPaths() (empty parentheses). This becomes especially important if working with R on network drives.

Install from CRAN

Most often, R users download packages from CRAN. CRAN (Comprehensive R Archive Network) is an online public warehouse of R packages that have been published by R community members.

Are you worried about viruses and security when downloading a package from CRAN? Read this article on the topic.

How to install and load

In this handbook, we suggest using the pacman package (short for “package manager”). It offers a convenient function p_load() which will install a package if necessary and load it for use in the current R session.

The syntax quite simple. Just list the names of the packages within the p_load() parentheses, separated by commas. This command will install the rio, tidyverse, and here packages if they are not yet installed, and will load them for use. This makes the p_load() approach convenient and concise if sharing scripts with others. Note that package names are case-sensitive.

# Install (if necessary) and load packages for use
pacman::p_load(rio, tidyverse, here)

Note that we have used the syntax pacman::p_load() which explicitly writes the package name (pacman) prior to the function name (p_load()), connected by two colons ::. This syntax is useful because it also loads the pacman package (assuming it is already installed).

There are alternative base R functions that you will see often. The base R function for installing a package is install.packages(). The name of the package to install must be provided in the parentheses in quotes. If you want to install multiple packages in one command, they must be listed within a character vector c().

Note: this command installs a package, but does not load it for use in the current session.

# install a single package with base R
install.packages("tidyverse")

# install multiple packages with base R
install.packages(c("tidyverse", "rio", "here"))

Installation can also be accomplished point-and-click by going to the RStudio “Packages” pane and clicking “Install” and searching for the desired package name.

The base R function to load a package for use (after it has been installed) is library(). It can load only one package at a time (another reason to use p_load()). You can provide the package name with or without quotes.

# load packages for use, with base R
library(tidyverse)
library(rio)
library(here)

To check whether a package in installed and/or loaded, you can view the Packages pane in RStudio. If the package is installed, it is shown there with version number. If its box is checked, it is loaded for the current session.

Install from Github

Sometimes, you need to install a package that is not yet available from CRAN. Or perhaps the package is available on CRAN but you want the development version with new features not yet offered in the more stable published CRAN version. These are often hosted on the website github.com in a free, public-facing code “repository”. Read more about Github in the handbook page on Version control and collaboration with Git and Github.

To download R packages from Github, you can use the function p_load_gh() from pacman, which will install the package if necessary, and load it for use in your current R session. Alternatives to install include using the remotes or devtools packages. Read more about all the pacman functions in the package documentation.

To install from Github, you have to provide more information. You must provide:

  1. The Github ID of the repository owner
  2. The name of the repository that contains the package
  3. (optional) The name of the “branch” (specific development version) you want to download

In the examples below, the first word in the quotation marks is the Github ID of the repository owner, after the slash is the name of the repository (the name of the package).

# install/load the epicontacts package from its Github repository
p_load_gh("reconhub/epicontacts")

If you want to install from a “branch” (version) other than the main branch, add the branch name after an “@”, after the repository name.

# install the "timeline" branch of the epicontacts package from Github
p_load_gh("reconhub/epicontacts@timeline")

If there is no difference between the Github version and the version on your computer, no action will be taken. You can “force” a re-install by instead using p_load_current_gh() with the argument update = TRUE. Read more about pacman in this online vignette

Install from ZIP or TAR

You could install the package from a URL:

packageurl <- "https://cran.r-project.org/src/contrib/Archive/dsr/dsr_0.2.2.tar.gz"
install.packages(packageurl, repos=NULL, type="source")

Or, download it to your computer in a zipped file:

Option 1: using install_local() from the remotes package

remotes::install_local("~/Downloads/dplyr-master.zip")

Option 2: using install.packages() from base R, providing the file path to the ZIP file and setting type = "source and repos = NULL.

install.packages("~/Downloads/dplyr-master.zip", repos=NULL, type="source")

Code syntax

For clarity in this handbook, functions are sometimes preceded by the name of their package using the :: symbol in the following way: package_name::function_name()

Once a package is loaded for a session, this explicit style is not necessary. One can just use function_name(). However writing the package name is useful when a function name is common and may exist in multiple packages (e.g. plot()). Writing the package name will also load the package if it is not already loaded.

# This command uses the package "rio" and its function "import()" to import a dataset
linelist <- rio::import("linelist.xlsx", which = "Sheet1")

Function help

To read more about a function, you can search for it in the Help tab of the lower-right RStudio. You can also run a command like ?thefunctionname (put the name of the function after a question mark) and the Help page will appear in the Help pane. Finally, try searching online for resources.

Update packages

You can update packages by re-installing them. You can also click the green “Update” button in your RStudio Packages pane to see which packages have new versions to install. Be aware that your old code may need to be updated if there is a major revision to how a function works!

Delete packages

Use p_delete() from pacman, or remove.packages() from base R. Alternatively, go find the folder which contains your library and manually delete the folder.

Dependencies

Packages often depend on other packages to work. These are called dependencies. If a dependency fails to install, then the package depending on it may also fail to install.

See the dependencies of a package with p_depends(), and see which packages depend on it with p_depends_reverse()

Masked functions

It is not uncommon that two or more packages contain the same function name. For example, the package dplyr has a filter() function, but so does the package stats. The default filter() function depends on the order these packages are first loaded in the R session - the later one will be the default for the command filter().

You can check the order in your Environment pane of R Studio - click the drop-down for “Global Environment” and see the order of the packages. Functions from packages lower on that drop-down list will mask functions of the same name in packages that appear higher in the drop-down list. When first loading a package, R will warn you in the console if masking is occurring, but this can be easy to miss.

Here are ways you can fix masking:

  1. Specify the package name in the command. For example, use dplyr::filter()
  2. Re-arrange the order in which the packages are loaded (e.g. within p_load()), and start a new R session

Detach / unload

To detach (unload) a package, use this command, with the correct package name and only one colon. Note that this may not resolve masking.

detach(package:PACKAGE_NAME_HERE, unload=TRUE)

Install older version

See this guide to install an older version of a particular package.

Suggested packages

See the page on Suggested packages for a listing of packages we recommend for everyday epidemiology.

3.8 Scripts

Scripts are a fundamental part of programming. They are documents that hold your commands (e.g. functions to create and modify datasets, print visualizations, etc). You can save a script and run it again later. There are many advantages to storing and running your commands from a script (vs. typing commands one-by-one into the R console “command line”):

  • Portability - you can share your work with others by sending them your scripts
  • Reproducibility - so that you and others know exactly what you did
  • Version control - so you can track changes made by yourself or colleagues
  • Commenting/annotation - to explain to your colleagues what you have done

Commenting

In a script you can also annotate (“comment”) around your R code. Commenting is helpful to explain to yourself and other readers what you are doing. You can add a comment by typing the hash symbol (#) and writing your comment after it. The commented text will appear in a different color than the R code.

Any code written after the # will not be run. Therefore, placing a # before code is also a useful way to temporarily block a line of code (“comment out”) if you do not want to delete it). You can comment out/in multiple lines at once by highlighting them and pressing Ctrl+Shift+c (Cmd+Shift+c in Mac).

# A comment can be on a line by itself
# import data
linelist <- import("linelist_raw.xlsx") %>%   # a comment can also come after code
# filter(age > 50)                          # It can also be used to deactivate / remove a line of code
  count()
  • Comment on what you are doing and on why you are doing it.
  • Break your code into logical sections
  • Accompany your code with a text step-by-step description of what you are doing (e.g. numbered steps)

Style

It is important to be conscious of your coding style - especially if working on a team. We advocate for the tidyverse style guide. There are also packages such as styler and lintr which help you conform to this style.

A few very basic points to make your code readable to others:
* When naming objects, use only lowercase letters, numbers, and underscores _, e.g. my_data
* Use frequent spaces, including around operators, e.g. n = 1 and age_new <- age_old + 3

Example Script

Below is an example of a short R script. Remember, the better you succinctly explain your code in comments, the more your colleagues will like you!

R markdown

An R markdown script is a type of R script in which the script itself becomes an output document (PDF, Word, HTML, Powerpoint, etc.). These are incredibly useful and versatile tools often used to create dynamic and automated reports. Even this website and handbook is produced with R markdown scripts!

It is worth noting that beginner R users can also use R Markdown - do not be intimidated! To learn more, see the handbook page on Reports with R Markdown documents.

R notebooks

There is no difference between writing in a Rmarkdown vs an R notebook. However the execution of the document differs slightly. See this site for more details.

Shiny

Shiny apps/websites are contained within one script, which must be named app.R. This file has three components:

  1. A user interface (ui)
  2. A server function
  3. A call to the shinyApp function

See the handbook page on Dashboards with Shiny, or this online tutorial: Shiny tutorial

In older times, the above file was split into two files (ui.R and server.R)

Code folding

You can collapse portions of code to make your script easier to read.

To do this, create a text header with #, write your header, and follow it with at least 4 of either dashes (-), hashes (#) or equals (=). When you have done this, a small arrow will appear in the “gutter” to the left (by the row number). You can click this arrow and the code below until the next header will collapse and a dual-arrow icon will appear in its place.

To expand the code, either click the arrow in the gutter again, or the dual-arrow icon. There are also keyboard shortcuts as explained in the RStudio section of this page.

By creating headers with #, you will also activate the Table of Contents at the bottom of your script (see below) that you can use to navigate your script. You can create sub-headers by adding more # symbols, for example # for primary, ## for seconary, and ### for tertiary headers.

Below are two versions of an example script. On the left is the original with commented headers. On the right, four dashes have been written after each header, making them collapsible. Two of them have been collapsed, and you can see that the Table of Contents at the bottom now shows each section.

Other areas of code that are automatically eligible for folding include “braced” regions with brackets { } such as function definitions or conditional blocks (if else statements). You can read more about code folding at the RStudio site.

3.9 Working directory

The working directory is the root folder location used by R for your work - where R looks for and saves files by default. By default, it will save new files and outputs to this location, and will look for files to import (e.g. datasets) here as well.

The working directory appears in grey text at the top of the RStudio Console pane. You can also print the current working directory by running getwd() (leave the parentheses empty).

Set by command

Until recently, many people learning R were taught to begin their scripts with a setwd() command. Please instead consider using an R project-oriented workflow and read the reasons for not using setwd(). In brief, your work becomes specific to your computer, file paths used to import and export files become “brittle”, and this severely hinders collaboration and use of your code on any other computer. There are easy alternatives!

As noted above, although we do not recommend this approach in most circumstances, you can use the command setwd() with the desired folder file path in quotations, for example:

setwd("C:/Documents/R Files/My analysis")

DANGER: Setting a working directory with setwd() can be “brittle” if the file path is specific to one computer. Instead, use file paths relative to an R Project root directory (with the here package).

Set manually

To set the working directory manually (the point-and-click equivalent of setwd()), click the Session drop-down menu and go to “Set Working Directory” and then “Choose Directory”. This will set the working directory for that specific R session. Note: if using this approach, you will have to do this manually each time you open RStudio.

Within an R project

If using an R project, the working directory will default to the R project root folder that contains the “.rproj” file. This will apply if you open RStudio by clicking open the R Project (the file with “.rproj” extension).

Working directory in an R markdown

In an R markdown script, the default working directory is the folder the Rmarkdown file (.Rmd) is saved within. If using an R project and here package, this does not apply and the working directory will be here() as explained in the R projects page.

If you want to change the working directory of a stand-alone R markdown (not in an R project), if you use setwd() this will only apply to that specific code chunk. To make the change for all code chunks in an R markdown, edit the setup chunk to add the root.dir = parameter, such as below:

knitr::opts_knit$set(root.dir = 'desired/directorypath')

It is much easier to just use the R markdown within an R project and use the here package.

Providing file paths

Perhaps the most common source of frustration for an R beginner (at least on a Windows machine) is typing in a file path to import or export data. There is a thorough explanation of how to best input file paths in the Import and export page, but here are a few key points:

Broken paths

Below is an example of an “absolute” or “full address” file path. These will likely break if used by another computer. One exception is if you are using a shared/network drive.

C:/Users/Name/Document/Analytic Software/R/Projects/Analysis2019/data/March2019.csv  

Slash direction

If typing in a file path, be aware the direction of the slashes. Use forward slashes (/) to separate the components (“data/provincial.csv”). For Windows users, the default way that file paths are displayed is with back slashes (\) - so you will need to change the direction of each slash. If you use the here package as described in the R projects page the slash direction is not an issue.

Relative file paths

We generally recommend providing “relative” filepaths instead - that is, the path relative to the root of your R Project. You can do this using the here package as explained in the R projects page. A relativel filepath might look like this:

# Import csv linelist from the data/linelist/clean/ sub-folders of an R project
linelist <- import(here("data", "clean", "linelists", "marin_country.csv"))

Even if using relative file paths within an R project, you can still use absolute paths to import/export data outside your R project.

3.10 Objects

Everything in R is an object, and R is an “object-oriented” language. These sections will explain:

  • How to create objects (<-)
  • Types of objects (e.g. data frames, vectors..)
  • How to access subparts of objects (e.g. variables in a dataset)
  • Classes of objects (e.g. numeric, logical, integer, double, character, factor)

Everything is an object

This section is adapted from the R4Epis project.
Everything you store in R - datasets, variables, a list of village names, a total population number, even outputs such as graphs - are objects which are assigned a name and can be referenced in later commands.

An object exists when you have assigned it a value (see the assignment section below). When it is assigned a value, the object appears in the Environment (see the upper right pane of RStudio). It can then be operated upon, manipulated, changed, and re-defined.

Defining objects (<-)

Create objects by assigning them a value with the <- operator.
You can think of the assignment operator <- as the words “is defined as”. Assignment commands generally follow a standard order:

object_name <- value (or process/calculation that produce a value)

For example, you may want to record the current epidemiological reporting week as an object for reference in later code. In this example, the object current_week is created when it is assigned the value "2018-W10" (the quote marks make this a character value). The object current_week will then appear in the RStudio Environment pane (upper-right) and can be referenced in later commands.

See the R commands and their output in the boxes below.

current_week <- "2018-W10"   # this command creates the object current_week by assigning it a value
current_week                 # this command prints the current value of current_week object in the console
## [1] "2018-W10"

NOTE: Note the [1] in the R console output is simply indicating that you are viewing the first item of the output

CAUTION: An object’s value can be over-written at any time by running an assignment command to re-define its value. Thus, the order of the commands run is very important.

The following command will re-define the value of current_week:

current_week <- "2018-W51"   # assigns a NEW value to the object current_week
current_week                 # prints the current value of current_week in the console
## [1] "2018-W51"

Equals signs =

You will also see equals signs in R code:

  • A double equals sign == between two objects or values asks a logical question: “is this equal to that?”.
  • You will also see equals signs within functions used to specify values of function arguments (read about these in sections below), for example max(age, na.rm = TRUE).
  • You can use a single equals sign = in place of <- to create and define objects, but this is discouraged. You can read about why this is discouraged here.

Datasets

Datasets are also objects (typically “dataframes”) and must be assigned names when they are imported. In the code below, the object linelist is created and assigned the value of a CSV file imported with the rio package and its import() function.

# linelist is created and assigned the value of the imported CSV file
linelist <- import("my_linelist.csv")

You can read more about importing and exporting datasets with the section on Import and export.

CAUTION: A quick note on naming of objects:

  • Object names must not contain spaces, but you should use underscore (_) or a period (.) instead of a space.
  • Object names are case-sensitive (meaning that Dataset_A is different from dataset_A).
  • Object names must begin with a letter (cannot begin with a number like 1, 2 or 3).

Outputs

Outputs like tables and plots provide an example of how outputs can be saved as objects, or just be printed without being saved. A cross-tabulation of gender and outcome using the base R function table() can be printed directly to the R console (without being saved).

# printed to R console only
table(linelist$gender, linelist$outcome)
##    
##     Death Recover
##   f  1227     953
##   m  1228     950

But the same table can be saved as a named object. Then, optionally, it can be printed.

# save
gen_out_table <- table(linelist$gender, linelist$outcome)

# print
gen_out_table
##    
##     Death Recover
##   f  1227     953
##   m  1228     950

Columns

Columns in a dataset are also objects and can be defined, over-written, and created as described below in the section on Columns.

You can use the assignment operator from base R to create a new column. Below, the new column bmi (Body Mass Index) is created, and for each row the new value is result of a mathematical operation on the row’s value in the wt_kg and ht_cm columns.

# create new "bmi" column using base R syntax
linelist$bmi <- linelist$wt_kg / (linelist$ht_cm/100)^2

However, in this handbook, we emphasize a different approach to defining columns, which uses the function mutate() from the dplyr package and piping with the pipe operator (%>%). The syntax is easier to read and there are other advantages explained in the page on Cleaning data and core functions. You can read more about piping in the Piping section below.

# create new "bmi" column using dplyr syntax
linelist <- linelist %>% 
  mutate(bmi = wt_kg / (ht_cm/100)^2)

Object structure

Objects can be a single piece of data (e.g. my_number <- 24), or they can consist of structured data.

The graphic below is borrowed from this online R tutorial. It shows some common data structures and their names. Not included in this image is spatial data, which is discussed in the GIS basics page.

In epidemiology (and particularly field epidemiology), you will most commonly encounter data frames and vectors:

Common structure Explanation Example
Vectors A container for a sequence of singular objects, all of the same class (e.g. numeric, character). “Variables” (columns) in data frames are vectors (e.g. the column age_years).
Data Frames Vectors (e.g. columns) that are bound together that all have the same number of rows. linelist is a data frame.

Note that to create a vector that “stands alone” (is not part of a data frame) the function c() is used to combine the different elements. For example, if creating a vector of colors plot’s color scale: vector_of_colors <- c("blue", "red2", "orange", "grey")

Object classes

All the objects stored in R have a class which tells R how to handle the object. There are many possible classes, but common ones include:

Class Explanation Examples
Character These are text/words/sentences “within quotation marks”. Math cannot be done on these objects. “Character objects are in quotation marks”
Integer Numbers that are whole only (no decimals) -5, 14, or 2000
Numeric These are numbers and can include decimals. If within quotation marks they will be considered character class. 23.1 or 14
Factor These are vectors that have a specified order or hierarchy of values An variable of economic status with ordered values
Date Once R is told that certain data are Dates, these data can be manipulated and displayed in special ways. See the page on Working with dates for more information. 2018-04-12 or 15/3/1954 or Wed 4 Jan 1980
Logical Values must be one of the two special values TRUE or FALSE (note these are not “TRUE” and “FALSE” in quotation marks) TRUE or FALSE
data.frame A data frame is how R stores a typical dataset. It consists of vectors (columns) of data bound together, that all have the same number of observations (rows). The example AJS dataset named linelist_raw contains 68 variables with 300 observations (rows) each.
tibble tibbles are a variation on data frame, the main operational difference being that they print more nicely to the console (display first 10 rows and only columns that fit on the screen) Any data frame, list, or matrix can be converted to a tibble with as_tibble()
list A list is like vector, but holds other objects that can be other different classes A list could hold a single number, and a dataframe, and a vector, and even another list within it!

You can test the class of an object by providing its name to the function class(). Note: you can reference a specific column within a dataset using the $ notation to separate the name of the dataset and the name of the column.

class(linelist)         # class should be a data frame or tibble
## [1] "data.frame"
class(linelist$age)     # class should be numeric
## [1] "numeric"
class(linelist$gender)  # class should be character
## [1] "character"

Sometimes, a column will be converted to a different class automatically by R. Watch out for this! For example, if you have a vector or column of numbers, but a character value is inserted… the entire column will change to class character.

num_vector <- c(1,2,3,4,5) # define vector as all numbers
class(num_vector)          # vector is numeric class
## [1] "numeric"
num_vector[3] <- "three"   # convert the third element to a character
class(num_vector)          # vector is now character class
## [1] "character"

One common example of this is when manipulating a data frame in order to print a table - if you make a total row and try to paste/glue together percents in the same cell as numbers (e.g. 23 (40%)), the entire numeric column above will convert to character and can no longer be used for mathematical calculations.Sometimes, you will need to convert objects or columns to another class.

Function Action
as.character() Converts to character class
as.numeric() Converts to numeric class
as.integer() Converts to integer class
as.Date() Converts to Date class - Note: see section on dates for details
factor() Converts to factor - Note: re-defining order of value levels requires extra arguments

Likewise, there are base R functions to check whether an object IS of a specific class, such as is.numeric(), is.character(), is.double(), is.factor(), is.integer()

Here is more online material on classes and data structures in R.

Columns/Variables ($)

A column in a data frame is technically a “vector” (see table above) - a series of values that must all be the same class (either character, numeric, logical, etc).

A vector can exist independent of a data frame, for example a vector of column names that you want to include as explanatory variables in a model. To create a “stand alone” vector, use the c() function as below:

# define the stand-alone vector of character values
explanatory_vars <- c("gender", "fever", "chills", "cough", "aches", "vomit")

# print the values in this named vector
explanatory_vars
## [1] "gender" "fever"  "chills" "cough"  "aches"  "vomit"

Columns in a data frame are also vectors and can be called, referenced, extracted, or created using the $ symbol. The $ symbol connects the name of the column to the name of its data frame. In this handbook, we try to use the word “column” instead of “variable”.

# Retrieve the length of the vector age_years
length(linelist$age) # (age is a column in the linelist data frame)

By typing the name of the dataframe followed by $ you will also see a drop-down menu of all columns in the data frame. You can scroll through them using your arrow key, select one with your Enter key, and avoid spelling mistakes!

ADVANCED TIP: Some more complex objects (e.g. a list, or an epicontacts object) may have multiple levels which can be accessed through multiple dollar signs. For example epicontacts$linelist$date_onset

Access/index with brackets ([ ])

You may need to view parts of objects, also called “indexing”, which is often done using the square brackets [ ]. Using $ on a dataframe to access a column is also a type of indexing.

my_vector <- c("a", "b", "c", "d", "e", "f")  # define the vector
my_vector[5]                                  # print the 5th element
## [1] "e"

Square brackets also work to return specific parts of an returned output, such as the output of a summary() function:

# All of the summary
summary(linelist$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    6.00   13.00   16.07   23.00   84.00      86
# Just the second element of the summary, with name (using only single brackets)
summary(linelist$age)[2]
## 1st Qu. 
##       6
# Just the second element, without name (using double brackets)
summary(linelist$age)[[2]]
## [1] 6
# Extract an element by name, without showing the name
summary(linelist$age)[["Median"]]
## [1] 13

Brackets also work on data frames to view specific rows and columns. You can do this using the syntax dataframe[rows, columns]:

# View a specific row (2) from dataset, with all columns (don't forget the comma!)
linelist[2,]

# View all rows, but just one column
linelist[, "date_onset"]

# View values from row 2 and columns 5 through 10
linelist[2, 5:10] 

# View values from row 2 and columns 5 through 10 and 18
linelist[2, c(5:10, 18)] 

# View rows 2 through 20, and specific columns
linelist[2:20, c("date_onset", "outcome", "age")]

# View rows and columns based on criteria
# *** Note the dataframe must still be named in the criteria!
linelist[linelist$age > 25 , c("date_onset", "outcome", "age")]

# Use View() to see the outputs in the RStudio Viewer pane (easier to read) 
# *** Note the capital "V" in View() function
View(linelist[2:20, "date_onset"])

# Save as a new object
new_table <- linelist[2:20, c("date_onset")] 

Note that you can also achieve the above row/column indexing on data frames and tibbles using dplyr syntax (functions filter() for rows, and select() for columns). Read more about these core functions in the Cleaning data and core functions page.

To filter based on “row number”, you can use the dplyr function row_number() with open parentheses as part of a logical filtering statement. Often you will use the %in% operator and a range of numbers as part of that logical statement, as shown below. To see the first N rows, you can also use the special dplyr function head().

# View first 100 rows
linelist %>% head(100)

# Show row 5 only
linelist %>% filter(row_number() == 5)

# View rows 2 through 20, and three specific columns (note no quotes necessary on column names)
linelist %>% filter(row_number() %in% 2:20) %>% select(date_onset, outcome, age)

When indexing an object of class list, single brackets always return with class list, even if only a single object is returned. Double brackets, however, can be used to access a single element and return a different class than list.
Brackets can also be written after one another, as demonstrated below.

This visual explanation of lists indexing, with pepper shakers is humorous and helpful.

# define demo list
my_list <- list(
  # First element in the list is a character vector
  hospitals = c("Central", "Empire", "Santa Anna"),
  
  # second element in the list is a data frame of addresses
  addresses   = data.frame(
    street = c("145 Medical Way", "1048 Brown Ave", "999 El Camino"),
    city   = c("Andover", "Hamilton", "El Paso")
    )
  )

Here is how the list looks when printed to the console. See how there are two named elements:

  • hospitals, a character vector
  • addresses, a data frame of addresses
my_list
## $hospitals
## [1] "Central"    "Empire"     "Santa Anna"
## 
## $addresses
##            street     city
## 1 145 Medical Way  Andover
## 2  1048 Brown Ave Hamilton
## 3   999 El Camino  El Paso

Now we extract, using various methods:

my_list[1] # this returns the element in class "list" - the element name is still displayed
## $hospitals
## [1] "Central"    "Empire"     "Santa Anna"
my_list[[1]] # this returns only the (unnamed) character vector
## [1] "Central"    "Empire"     "Santa Anna"
my_list[["hospitals"]] # you can also index by name of the list element
## [1] "Central"    "Empire"     "Santa Anna"
my_list[[1]][3] # this returns the third element of the "hospitals" character vector
## [1] "Santa Anna"
my_list[[2]][1] # This returns the first column ("street") of the address data frame
##            street
## 1 145 Medical Way
## 2  1048 Brown Ave
## 3   999 El Camino

Remove objects

You can remove individual objects from your R environment by putting the name in the rm() function (no quote marks):

rm(object_name)

You can remove all objects (clear your workspace) by running:

rm(list = ls(all = TRUE))

3.11 Piping (%>%)

Two general approaches to working with objects are:

  1. Pipes/tidyverse - pipes send an object from function to function - emphasis is on the action, not the object
  2. Define intermediate objects - an object is re-defined again and again - emphasis is on the object

Pipes

Simply explained, the pipe operator (%>%) passes an intermediate output from one function to the next.
You can think of it as saying “then”. Many functions can be linked together with %>%.

  • Piping emphasizes a sequence of actions, not the object the actions are being performed on
  • Pipes are best when a sequence of actions must be performed on one object
  • Pipes come from the package magrittr, which is automatically included in packages dplyr and tidyverse
  • Pipes can make code more clean and easier to read, more intuitive

Read more on this approach in the tidyverse style guide

Here is a fake example for comparison, using fictional functions to “bake a cake”. First, the pipe method:

# A fake example of how to bake a cake using piping syntax

cake <- flour %>%       # to define cake, start with flour, and then...
  add(eggs) %>%   # add eggs
  add(oil) %>%    # add oil
  add(water) %>%  # add water
  mix_together(         # mix together
    utensil = spoon,
    minutes = 2) %>%    
  bake(degrees = 350,   # bake
       system = "fahrenheit",
       minutes = 35) %>%  
  let_cool()            # let it cool down

Here is another link describing the utility of pipes.

Piping is not a base function. To use piping, the magrittr package must be installed and loaded (this is typically done by loading tidyverse or dplyr package which include it). You can read more about piping in the magrittr documentation.

Note that just like other R commands, pipes can be used to just display the result, or to save/re-save an object, depending on whether the assignment operator <- is involved. See both below:

# Create or overwrite object, defining as aggregate counts by age category (not printed)
linelist_summary <- linelist %>% 
  count(age_cat)
# Print the table of counts in the console, but don't save it
linelist %>% 
  count(age_cat)
##   age_cat    n
## 1     0-4 1095
## 2     5-9 1095
## 3   10-14  941
## 4   15-19  743
## 5   20-29 1073
## 6   30-49  754
## 7   50-69   95
## 8     70+    6
## 9    <NA>   86

%<>%
This is an “assignment pipe” from the magrittr package, which pipes an object forward and also re-defines the object. It must be the first pipe operator in the chain. It is shorthand. The below two commands are equivalent:

linelist <- linelist %>%
  filter(age > 50)

linelist %<>% filter(age > 50)

Define intermediate objects

This approach to changing objects/dataframes may be better if:

  • You need to manipulate multiple objects
  • There are intermediate steps that are meaningful and deserve separate object names

Risks:

  • Creating new objects for each step means creating lots of objects. If you use the wrong one you might not realize it!
  • Naming all the objects can be confusing
  • Errors may not be easily detectable

Either name each intermediate object, or overwrite the original, or combine all the functions together. All come with their own risks.

Below is the same fake “cake” example as above, but using this style:

# a fake example of how to bake a cake using this method (defining intermediate objects)
batter_1 <- left_join(flour, eggs)
batter_2 <- left_join(batter_1, oil)
batter_3 <- left_join(batter_2, water)

batter_4 <- mix_together(object = batter_3, utensil = spoon, minutes = 2)

cake <- bake(batter_4, degrees = 350, system = "fahrenheit", minutes = 35)

cake <- let_cool(cake)

Combine all functions together - this is difficult to read:

# an example of combining/nesting mutliple functions together - difficult to read
cake <- let_cool(bake(mix_together(batter_3, utensil = spoon, minutes = 2), degrees = 350, system = "fahrenheit", minutes = 35))

3.12 Key operators and functions

This section details operators in R, such as:

  • Definitional operators
  • Relational operators (less than, equal too..)
  • Logical operators (and, or…)
  • Handling missing values
  • Mathematical operators and functions (+/-, >, sum(), median(), …)
  • The %in% operator

Assignment operators

<-

The basic assignment operator in R is <-. Such that object_name <- value.
This assignment operator can also be written as =. We advise use of <- for general R use.
We also advise surrounding such operators with spaces, for readability.

<<-

If Writing functions, or using R in an interactive way with sourced scripts, then you may need to use this assignment operator <<- (from base R). This operator is used to define an object in a higher ‘parent’ R Environment. See this online reference.

%<>%

This is an “assignment pipe” from the magrittr package, which pipes an object forward and also re-defines the object. It must be the first pipe operator in the chain. It is shorthand, as shown below in two equivalent examples:

linelist <- linelist %>% 
  mutate(age_months = age_years * 12)

The above is equivalent to the below:

linelist %<>% mutate(age_months = age_years * 12)

%<+%

This is used to add data to phylogenetic trees with the ggtree package. See the page on Phylogenetic trees or this online resource book.

Relational and logical operators

Relational operators compare values and are often used when defining new variables and subsets of datasets. Here are the common relational operators in R:

Meaning Operator Example Example Result
Equal to == "A" == "a" FALSE (because R is case sensitive) Note that == (double equals) is different from = (single equals), which acts like the assignment operator <-
Not equal to != 2 != 0 TRUE
Greater than > 4 > 2 TRUE
Less than < 4 < 2 FALSE
Greater than or equal to >= 6 >= 4 TRUE
Less than or equal to <= 6 <= 4 FALSE
Value is missing is.na() is.na(7) FALSE (see page on Missing data)
Value is not missing !is.na() !is.na(7) TRUE

Logical operators, such as AND and OR, are often used to connect relational operators and create more complicated criteria. Complex statements might require parentheses ( ) for grouping and order of application.

Meaning Operator
AND &
OR | (vertical bar)
Parentheses ( ) Used to group criteria together and clarify order of operations

For example, below, we have a linelist with two variables we want to use to create our case definition, hep_e_rdt, a test result and other_cases_in_hh, which will tell us if there are other cases in the household. The command below uses the function case_when() to create the new variable case_def such that:

linelist_cleaned <- linelist %>%
  mutate(case_def = case_when(
    is.na(rdt_result) & is.na(other_case_in_home)            ~ NA_character_,
    rdt_result == "Positive"                                 ~ "Confirmed",
    rdt_result != "Positive" & other_cases_in_home == "Yes"  ~ "Probable",
    TRUE                                                     ~ "Suspected"
  ))
Criteria in example above Resulting value in new variable “case_def”
If the value for variables rdt_result and other_cases_in_home are missing NA (missing)
If the value in rdt_result is “Positive” “Confirmed”
If the value in rdt_result is NOT “Positive” AND the value in other_cases_in_home is “Yes” “Probable”
If one of the above criteria are not met “Suspected”

Note that R is case-sensitive, so “Positive” is different than “positive”…

Missing values

In R, missing values are represented by the special value NA (a “reserved” value) (capital letters N and A - not in quotation marks). If you import data that records missing data in another way (e.g. 99, “Missing”, or .), you may want to re-code those values to NA. How to do this is addressed in the Import and export page.

To test whether a value is NA, use the special function is.na(), which returns TRUE or FALSE.

rdt_result <- c("Positive", "Suspected", "Positive", NA)   # two positive cases, one suspected, and one unknown
is.na(rdt_result)  # Tests whether the value of rdt_result is NA
## [1] FALSE FALSE FALSE  TRUE

Read more about missing, infinite, NULL, and impossible values in the page on Missing data. Learn how to convert missing values when importing data in the page on Import and export.

Mathematics and statistics

All the operators and functions in this page are automatically available using base R.

Mathematical operators

These are often used to perform addition, division, to create new columns, etc. Below are common mathematical operators in R. Whether you put spaces around the operators is not important.

Purpose Example in R
addition 2 + 3
subtraction 2 - 3
multiplication 2 * 3
division 30 / 5
exponent 2^3
order of operations ( )

Mathematical functions

Purpose Function
rounding round(x, digits = n)
rounding janitor::round_half_up(x, digits = n)
ceiling (round up) ceiling(x)
floor (round down) floor(x)
absolute value abs(x)
square root sqrt(x)
exponent exponent(x)
natural logarithm log(x)
log base 10 log10(x)
log base 2 log2(x)

Note: for round() the digits = specifies the number of decimal placed. Use signif() to round to a number of significant figures.

Scientific notation

The likelihood of scientific notation being used depends on the value of the scipen option.

From the documentation of ?options: scipen is a penalty to be applied when deciding to print numeric values in fixed or exponential notation. Positive values bias towards fixed and negative towards scientific notation: fixed notation will be preferred unless it is more than ‘scipen’ digits wider.

If it is set to a low number (e.g. 0) it will be “turned on” always. To “turn off” scientific notation in your R session, set it to a very high number, for example:

# turn off scientific notation
options(scipen=999)

Rounding

DANGER: round() uses “banker’s rounding” which rounds up from a .5 only if the upper number is even. Use round_half_up() from janitor to consistently round halves up to the nearest whole number. See this explanation

# use the appropriate rounding function for your work
round(c(2.5, 3.5))
## [1] 2 4
janitor::round_half_up(c(2.5, 3.5))
## [1] 3 4

Statistical functions

CAUTION: The functions below will by default include missing values in calculations. Missing values will result in an output of NA, unless the argument na.rm = TRUE is specified. This can be written shorthand as na.rm = T.

Objective Function
mean (average) mean(x, na.rm=T)
median median(x, na.rm=T)
standard deviation sd(x, na.rm=T)
quantiles* quantile(x, probs)
sum sum(x, na.rm=T)
minimum value min(x, na.rm=T)
maximum value max(x, na.rm=T)
range of numeric values range(x, na.rm=T)
summary** summary(x)

Notes:

  • *quantile(): x is the numeric vector to examine, and probs = is a numeric vector with probabilities within 0 and 1.0, e.g c(0.5, 0.8, 0.85)
  • **summary(): gives a summary on a numeric vector including mean, median, and common percentiles

DANGER: If providing a vector of numbers to one of the above functions, be sure to wrap the numbers within c() .

# If supplying raw numbers to a function, wrap them in c()
mean(1, 6, 12, 10, 5, 0)    # !!! INCORRECT !!!  
## [1] 1
mean(c(1, 6, 12, 10, 5, 0)) # CORRECT
## [1] 5.666667

Other useful functions

Objective Function Example
create a sequence seq(from, to, by) seq(1, 10, 2)
repeat x, n times rep(x, ntimes) rep(1:3, 2) or rep(c("a", "b", "c"), 3)
subdivide a numeric vector cut(x, n) cut(linelist$age, 5)
take a random sample sample(x, size) sample(linelist$id, size = 5, replace = TRUE)

%in%

A very useful operator for matching values, and for quickly assessing if a value is within a vector or dataframe.

my_vector <- c("a", "b", "c", "d")
"a" %in% my_vector
## [1] TRUE
"h" %in% my_vector
## [1] FALSE

To ask if a value is not %in% a vector, put an exclamation mark (!) in front of the logic statement:

# to negate, put an exclamation in front
!"a" %in% my_vector
## [1] FALSE
!"h" %in% my_vector
## [1] TRUE

%in% is very useful when using the dplyr function case_when(). You can define a vector previously, and then reference it later. For example:

affirmative <- c("1", "Yes", "YES", "yes", "y", "Y", "oui", "Oui", "Si")

linelist <- linelist %>% 
  mutate(child_hospitaled = case_when(
    hospitalized %in% affirmative & age < 18 ~ "Hospitalized Child",
    TRUE                                      ~ "Not"))

Note: If you want to detect a partial string, perhaps using str_detect() from stringr, it will not accept a character vector like c("1", "Yes", "yes", "y"). Instead, it must be given a regular expression - one condensed string with OR bars, such as “1|Yes|yes|y”. For example, str_detect(hospitalized, "1|Yes|yes|y"). See the page on Characters and strings for more information.

You can convert a character vector to a named regular expression with this command:

affirmative <- c("1", "Yes", "YES", "yes", "y", "Y", "oui", "Oui", "Si")
affirmative
## [1] "1"   "Yes" "YES" "yes" "y"   "Y"   "oui" "Oui" "Si"
# condense to 
affirmative_str_search <- paste0(affirmative, collapse = "|")  # option with base R
affirmative_str_search <- str_c(affirmative, collapse = "|")   # option with stringr package

affirmative_str_search
## [1] "1|Yes|YES|yes|y|Y|oui|Oui|Si"

3.13 Errors & warnings

This section explains:

  • The difference between errors and warnings
  • General syntax tips for writing R code
  • Code assists

Common errors and warnings and troubleshooting tips can be found in the page on [Errors and help].

Error versus Warning

When a command is run, the R Console may show you warning or error messages in red text.

  • A warning means that R has completed your command, but had to take additional steps or produced unusual output that you should be aware of.

  • An error means that R was not able to complete your command.

Look for clues:

  • The error/warning message will often include a line number for the problem.

  • If an object “is unknown” or “not found”, perhaps you spelled it incorrectly, forgot to call a package with library(), or forgot to re-run your script after making changes.

If all else fails, copy the error message into Google along with some key terms - chances are that someone else has worked through this already!

General syntax tips

A few things to remember when writing commands in R, to avoid errors and warnings:

  • Always close parentheses - tip: count the number of opening “(” and closing parentheses “)” for each code chunk
  • Avoid spaces in column and object names. Use underscore ( _ ) or periods ( . ) instead
  • Keep track of and remember to separate a function’s arguments with commas
  • R is case-sensitive, meaning Variable_A is different from variable_A

Code assists

Any script (RMarkdown or otherwise) will give clues when you have made a mistake. For example, if you forgot to write a comma where it is needed, or to close a parentheses, RStudio will raise a flag on that line, on the right side of the script, to warn you.

4 Transition to R

Below, we provide some advice and resources if you are transitioning to R.

R was introduced in the late 1990s and has since grown dramatically in scope. Its capabilities are so extensive that commercial alternatives have reacted to R developments in order to stay competitive! (read this article comparing R, SPSS, SAS, STATA, and Python).

Moreover, R is much easier to learn than it was 10 years ago. Previously, R had a reputation of being difficult for beginners. It is now much easier with friendly user-interfaces like RStudio, intuitive code like the tidyverse, and many tutorial resources.

Do not be intimidated - come discover the world of R!

4.1 From Excel

Transitioning from Excel directly to R is a very achievable goal. It may seem daunting, but you can do it!

It is true that someone with strong Excel skills can do very advanced activities in Excel alone - even using scripting tools like VBA. Excel is used across the world and is an essential tool for an epidemiologist. However, complementing it with R can dramatically improve and expand your work flows.

Benefits

You will find that using R offers immense benefits in time saved, more consistent and accurate analysis, reproducibility, shareability, and faster error-correction. Like any new software there is a learning “curve” of time you must invest to become familiar. The dividends will be significant and immense scope of new possibilities will open to you with R.

Excel is a well-known software that can be easy for a beginner to use to produce simple analysis and visualizations with “point-and-click”. In comparison, it can take a couple weeks to become comfortable with R functions and interface. However, R has evolved in recent years to become much more friendly to beginners.

Many Excel workflows rely on memory and on repetition - thus, there is much opportunity for error. Furthermore, generally the data cleaning, analysis methodology, and equations used are hidden from view. It can require substantial time for a new colleague to learn what an Excel workbook is doing and how to troubleshoot it. With R, all the steps are explicitly written in the script and can be easily viewed, edited, corrected, and applied to other datasets.

To begin your transition from Excel to R you must adjust your mindset in a few important ways:

Tidy data

Use machine-readable “tidy” data instead of messy “human-readable” data. These are the three main requirements for “tidy” data, as explained in this tutorial on “tidy” data in R:

  • Each variable must have its own column
  • Each observation must have its own row
  • Each value must have its own cell

To Excel users - think of the role that Excel “tables” play in standardizing data and making the format more predictable.

An example of “tidy” data would be the case linelist used throughout this handbook - each variable is contained within one column, each observation (one case) has it’s own row, and every value is in just one cell. Below you can view the first 50 rows of the linelist:

The main reason one encounters non-tidy data is because many Excel spreadsheets are designed to prioritize easy reading by humans, not easy reading by machines/software.

To help you see the difference, below are some fictional examples of non-tidy data that prioritize human-readability over machine-readability:

Problems: In the spreadsheet above, there are merged cells which are not easily digested by R. Which row should be considered the “header” is not clear. A color-based dictionary is to the right side and cell values are represented by colors - which is also not easily interpreted by R (nor by humans with color-blindness!). Furthermore, different pieces of information are combined into one cell (multiple partner organizations working in one area, or the status “TBC” in the same cell as “Partner D”).

Problems: In the spreadsheet above, there are numerous extra empty rows and columns within the dataset - this will cause cleaning headaches in R. Furthermore, the GPS coordinates are spread across two rows for a given treatment center. As a side note - the GPS coordinates are in two different formats!

“Tidy” datasets may not be as readable to a human eye, but they make data cleaning and analysis much easier! Tidy data can be stored in various formats, for example “long” or “wide”"(see page on Pivoting data), but the principles above are still observed.

Functions

The R word “function” might be new, but the concept exists in Excel too as formulas. Formulas in Excel also require precise syntax (e.g. placement of semicolons and parentheses). All you need to do is learn a few new functions and how they work together in R.

Scripts

Instead of clicking buttons and dragging cells you will be writing every step and procedure into a “script”. Excel users may be familiar with “VBA macros” which also employ a scripting approach.

The R script consists of step-by-step instructions. This allows any colleague to read the script and easily see the steps you took. This also helps de-bug errors or inaccurate calculations. See the R basics section on scripts for examples.

Here is an example of an R script:

Excel-to-R resources

Here are some links to tutorials to help you transition to R from Excel:

R-Excel interaction

R has robust ways to import Excel workbooks, work with the data, export/save Excel files, and work with the nuances of Excel sheets.

It is true that some of the more aesthetic Excel formatting can get lost in translation (e.g. italics, sideways text, etc.). If your work flow requires passing documents back-and-forth between R and Excel while retaining the original Excel formatting, try packages such as openxlsx.

4.2 From Stata

Coming to R from Stata

Many epidemiologists are first taught how to use Stata, and it can seem daunting to move into R. However, if you are a comfortable Stata user then the jump into R is certainly more manageable than you might think. While there are some key differences between Stata and R in how data can be created and modified, as well as how analysis functions are implemented – after learning these key differences you will be able to translate your skills.

Below are some key translations between Stata and R, which may be handy as your review this guide.

General notes

STATA R
You can only view and manipulate one dataset at a time You can view and manipulate multiple datasets at the same time, therefore you will frequently have to specify your dataset within the code
Online community available through https://www.statalist.org/ Online community available through RStudio, StackOverFlow, and R-bloggers
Point and click functionality as an option Minimal point and click functionality
Help for commands available by help [command] Help available by [function]? or search in the Help pane
Comment code using * or /// or /* TEXT */ Comment code using #
Almost all commands are built-in to Stata. New/user-written functions can be installed as ado files using ssc install [package] R installs with base functions, but typical use involves installing other packages from CRAN (see page on R basics)
Analysis is usually written in a do file Analysis written in an R script in the RStudio source pane. R markdown scripts are an alternative.

Working directory

STATA R
Working directories involve absolute filepaths (e.g. “C:/usename/documents/projects/data/”) Working directories can be either absolute, or relative to a project root folder by using the here package (see Import and export)
See current working directory with pwd Use getwd() or here() (if using the here package), with empty parentheses
Set working directory with cd “folder location” Use setwd(“folder location”), or set_here("folder location) (if using here package)

Importing and viewing data

STATA R
Specific commands per file type Use import() from rio package for almost all filetypes. Specific functions exist as alternatives (see Import and export)
Reading in csv files is done by import delimited “filename.csv” Use import("filename.csv")
Reading in xslx files is done by import excel “filename.xlsx” Use import("filename.xlsx")
Browse your data in a new window using the command browse View a dataset in the RStudio source pane using View(dataset). You need to specify your dataset name to the function in R because multiple datasets can be held at the same time. Note capital “V” in this function
Get a high-level overview of your dataset using summarize, which provides the variable names and basic information Get a high-level overview of your dataset using summary(dataset)

Basic data manipulation

STATA R
Dataset columns are often referred to as “variables” More often referred to as “columns” or sometimes as “vectors” or “variables”
No need to specify the dataset In each of the below commands, you need to specify the dataset - see the page on Cleaning data and core functions for examples
New variables are created using the command generate varname = Generate new variables using the function mutate(varname = ). See page on Cleaning data and core functions for details on all the below dplyr functions.
Variables are renamed using rename old_name new_name Columns can be renamed using the function rename(new_name = old_name)
Variables are dropped using drop varname Columns can be removed using the function select() with the column name in the parentheses following a minus sign
Factor variables can be labeled using a series of commands such as label define Labeling values can done by converting the column to Factor class and specifying levels. See page on Factors. Column names are not typically labeled as they are in Stata.

Descriptive analysis

STATA R
Tabulate counts of a variable using tab varname Provide the dataset and column name to table() such as table(dataset$colname). Alternatively, use count(varname) from the dplyr package, as explained in Grouping data
Cross-tabulaton of two variables in a 2x2 table is done with tab varname1 varname2 Use table(dataset$varname1, dataset$varname2 or count(varname1, varname2)

While this list gives an overview of the basics in translating Stata commands into R, it is not exhaustive. There are many other great resources for Stata users transitioning to R that could be of interest:

4.3 From SAS

Coming from SAS to R

SAS is commonly used at public health agencies and academic research fields. Although transitioning to a new language is rarely a simple process, understanding key differences between SAS and R may help you start to navigate the new language using your native language. Below outlines the key translations in data management and descriptive analysis between SAS and R.

General notes

SAS R
Online community available through SAS Customer Support Online community available through RStudio, StackOverFlow, and R-bloggers
Help for commands available by help [command] Help available by [function]? or search in the Help pane
Comment code using * TEXT ; or /* TEXT */ Comment code using #
Almost all commands are built-in. Users can write new functions using SAS macro, SAS/IML, SAS Component Language (SCL), and most recently, procedures Proc Fcmp and Proc Proto R installs with base functions, but typical use involves installing other packages from CRAN (see page on R basics)
Analysis is usually conducted by writing a SAS program in the Editor window. Analysis written in an R script in the RStudio source pane. R markdown scripts are an alternative.

Working directory

SAS R
Working directories can be either absolute, or relative to a project root folder by defining the root folder using %let rootdir=/root path; %include “&rootdir/subfoldername/filename” Working directories can be either absolute, or relative to a project root folder by using the here package (see Import and export)
See current working directory with %put %sysfunc(getoption(work)); Use getwd() or here() (if using the here package), with empty parentheses
Set working directory with libname “folder location” Use setwd(“folder location”), or set_here("folder location) if using here package

Importing and viewing data

SAS R
Use Proc Import procedure or using Data Step Infile statement. Use import() from rio package for almost all filetypes. Specific functions exist as alternatives (see Import and export)
Reading in csv files is done by using Proc Import datafile=”filename.csv” out=work.filename dbms=CSV; run; OR using Data Step Infile statement Use import("filename.csv")
Reading in xslx files is done by using Proc Import datafile=”filename.xlsx” out=work.filename dbms=xlsx; run; OR using Data Step Infile statement Use import(“filename.xlsx”)
Browse your data in a new window by opening the Explorer window and select desired library and the dataset View a dataset in the RStudio source pane using View(dataset). You need to specify your dataset name to the function in R because multiple datasets can be held at the same time. Note capital “V” in this function

Basic data manipulation

SAS R
Dataset columns are often referred to as “variables” More often referred to as “columns” or sometimes as “vectors” or “variables”
No special procedures are needed to create a variable. New variables are created simply by typing the new variable name, followed by an equal sign, and then an expression for the value Generate new variables using the function mutate(). See page on Cleaning data and core functions for details on all the below dplyr functions.
Variables are renamed using rename *old_name=new_name* Columns can be renamed using the function rename(new_name = old_name)
Variables are kept using **keep**=varname Columns can be selected using the function select() with the column name in the parentheses
Variables are dropped using **drop**=varname Columns can be removed using the function select() with the column name in the parentheses following a minus sign
Factor variables can be labeled in the Data Step using Label statement Labeling values can done by converting the column to Factor class and specifying levels. See page on Factors. Column names are not typically labeled.
Records are selected using Where or If statement in the Data Step. Multiple selection conditions are separated using “and” command. Records are selected using the function filter() with multiple selection conditions separated either by an AND operator (&) or a comma
Datasets are combined using Merge statement in the Data Step. The datasets to be merged need to be sorted first using Proc Sort procedure. dplyr package offers a few functions for merging datasets. See page Joining Data for details.

Descriptive analysis

SAS R
Get a high-level overview of your dataset using Proc Summary procedure, which provides the variable names and descriptive statistics Get a high-level overview of your dataset using summary(dataset) or skim(dataset) from the skimr package
Tabulate counts of a variable using proc freq data=Dataset; Tables varname; Run; See the page on Descriptive tables. Options include table() from base R, and tabyl() from janitor package, among others. Note you will need to specify the dataset and column name as R holds multiple datasets.
Cross-tabulation of two variables in a 2x2 table is done with proc freq data=Dataset; Tables rowvar*colvar; Run; Again, you can use table(), tabyl() or other options as described in the Descriptive tables page.

Some useful resources:

R for SAS and SPSS Users (2011)

SAS and R, Second Edition (2014)

4.4 Data interoperability

See the Import and export page for details on how the R package rio can import and export files such as STATA .dta files, SAS .xpt and.sas7bdat files, SPSS .por and.sav files, and many others.

5 Suggested packages

Below is a long list of suggested packages for common epidemiological work in R. You can copy this code, run it, and all of these packages will install from CRAN and load for use in the current R session. If a package is already installed, it will be loaded for use only.

You can modify the code with # symbols to exclude any packages you do not want.

Of note:

  • Install the pacman package first before running the below code. You can do this with install.packages("pacman"). In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use in the current R session. You can also load packages that are already installed with library() from base R.
  • In the code below, packages that are included when installing/loading another package are indicated by an indent and hash. For example how ggplot2 is listed under tidyverse.
  • If multiple packages have functions with the same name, masking can occur when the function from the more recently-loaded package takes precedent. Read more in the R basics page. Consider using the package conflicted to manage such conflicts.
  • See the R basics section on packages for more information on pacman and masking.

To see the versions of R, RStudio, and R packages used during the production of this handbook, see the page on Editorial and technical notes.

5.1 Packages from CRAN

##########################################
# List of useful epidemiology R packages #
##########################################

# This script uses the p_load() function from pacman R package, 
# which installs if package is absent, and loads for use if already installed


# Ensures the package "pacman" is installed
if (!require("pacman")) install.packages("pacman")


# Packages available from CRAN
##############################
pacman::p_load(
     
     # learning R
     ############
     learnr,   # interactive tutorials in RStudio Tutorial pane
     swirl,    # interactive tutorials in R console
        
     # project and file management
     #############################
     here,     # file paths relative to R project root folder
     rio,      # import/export of many types of data
     openxlsx, # import/export of multi-sheet Excel workbooks 
     
     # package install and management
     ################################
     pacman,   # package install/load
     renv,     # managing versions of packages when working in collaborative groups
     remotes,  # install from github
     
     # General data management
     #########################
     tidyverse,    # includes many packages for tidy data wrangling and presentation
          #dplyr,      # data management
          #tidyr,      # data management
          #ggplot2,    # data visualization
          #stringr,    # work with strings and characters
          #forcats,    # work with factors 
          #lubridate,  # work with dates
          #purrr       # iteration and working with lists
     linelist,     # cleaning linelists
     naniar,       # assessing missing data
     
     # statistics  
     ############
     janitor,      # tables and data cleaning
     gtsummary,    # making descriptive and statistical tables
     rstatix,      # quickly run statistical tests and summaries
     broom,        # tidy up results from regressions
     lmtest,       # likelihood-ratio tests
     easystats,
          # parameters, # alternative to tidy up results from regressions
          # see,        # alternative to visualise forest plots 
     
     # epidemic modeling
     ###################
     epicontacts,  # Analysing transmission networks
     EpiNow2,      # Rt estimation
     EpiEstim,     # Rt estimation
     projections,  # Incidence projections
     incidence2,   # Make epicurves and handle incidence data
     i2extras,     # Extra functions for the incidence2 package
     epitrix,      # Useful epi functions
     distcrete,    # Discrete delay distributions
     
     
     # plots - general
     #################
     #ggplot2,         # included in tidyverse
     cowplot,          # combining plots  
     # patchwork,      # combining plots (alternative)     
     RColorBrewer,     # color scales
     ggnewscale,       # to add additional layers of color schemes

     
     # plots - specific types
     ########################
     DiagrammeR,       # diagrams using DOT language
     incidence2,       # epidemic curves
     gghighlight,      # highlight a subset
     ggrepel,          # smart labels
     plotly,           # interactive graphics
     gganimate,        # animated graphics 

     
     # gis
     ######
     sf,               # to manage spatial data using a Simple Feature format
     tmap,             # to produce simple maps, works for both interactive and static maps
     OpenStreetMap,    # to add OSM basemap in ggplot map
     spdep,            # spatial statistics 
     
     # routine reports
     #################
     rmarkdown,        # produce PDFs, Word Documents, Powerpoints, and HTML files
     reportfactory,    # auto-organization of R Markdown outputs
     officer,          # powerpoints
     
     # dashboards
     ############
     flexdashboard,    # convert an R Markdown script into a dashboard
     shiny,            # interactive web apps
     
     # tables for presentation
     #########################
     knitr,            # R Markdown report generation and html tables
     flextable,        # HTML tables
     #DT,              # HTML tables (alternative)
     #gt,              # HTML tables (alternative)
     #huxtable,        # HTML tables (alternative) 
     
     # phylogenetics
     ###############
     ggtree,           # visualization and annotation of trees
     ape,              # analysis of phylogenetics and evolution
     treeio            # to visualize phylogenetic files
 
)

5.2 Packages from Github

Below are commmands to install two packages directly from Github repositories.

  • The development version of epicontacts contains the ability to make transmission trees with an temporal x-axis
  • The epirhandbook package contains all the example data for this handbook and can be used to download the offline version of the handbook.
# Packages to download from Github (not available on CRAN)
##########################################################

# Development version of epicontacts (for transmission chains with a time x-axis)
pacman::p_install_gh("reconhub/epicontacts@timeline")

# The package for this handbook, which includes all the example data  
pacman::p_install_gh("appliedepi/epirhandbook")

6 R projects

An R project enables your work to be bundled in a portable, self-contained folder. Within the project, all the relevant scripts, data files, figures/outputs, and history are stored in sub-folders and importantly - the working directory is the project’s root folder.

6.1 Suggested use

A common, efficient, and trouble-free way to use R is to combine these 3 elements. One discrete work project is hosted within one R project. Each element is described in the sections below.

  1. An R project
    • A self-contained working environment with folders for data, scripts, outputs, etc.
  2. The here package for relative filepaths
    • Filepaths are written relative to the root folder of the R project - see Import and export for more information
  3. The rio package for importing/exporting
    • import() and export() handle any file type by by its extension (e.g. .csv, .xlsx, .png)

6.2 Creating an R project

To create an R project, select “New Project” from the File menu.

  • If you want to create a new folder for the project, select “New directory” and indicate where you want it to be created.
  • If you want to create the project within an existing folder, click “Existing directory” and indicate the folder.
  • If you want to clone a Github repository, select the third option “Version Control” and then “Git”. See the page on Version control and collaboration with Git and Github for further details.

The R project you create will come in the form of a folder containing a .Rproj file. This file is a shortcut and likely the primary way you will open your project. You can also open a project by selecting “Open Project” from the File menu. Alternatively on the far upper right side of RStudio you will see an R project icon and a drop-down menu of available R projects.

To exit from an R project, either open a new project, or close the project (File - Close Project).

Switch projects

To switch between projects, click the R project icon and drop-down menu at the very top-right of RStudio. You will see options to Close Project, Open Project, and a list of recent projects.

Settings

It is generally advised that you start RStudio each time with a “clean slate” - that is, with your workspace not preserved from your previous session. This will mean that your objects and results will not persist session-to-session (you must re-create them by running your scripts). This is good, because it will force you to write better scripts and avoid errors in the long run.

To set RStudio to have a “clean slate” each time at start-up:

  • Select “Project Options” from the Tools menu.
  • In the “General” tab, set RStudio to not restore .RData into workspace at startup, and to not save workspace to .RData on exit.

Organization

It is common to have subfolders in your project. Consider having folders such as “data”, “scripts”, “figures”, “presentations”. You can add folders in the typical way you would add a new folder for your computer. Alternatively, see the page on Directory interactions to learn how to create new folders with R commands.

Version control

Consider a version control system. It could be something as simple as having dates on the names of scripts (e.g. “transmission_analysis_2020-10-03.R”) and an “archive” folder. Consider also having commented header text at the top of each script with a description, tags, authors, and change log.

A more complicated method would involve using Github or a similar platform for version control. See the page on Version control and collaboration with Git and Github.

One tip is that you can search across an entire project or folder using the “Find in Files” tool (Edit menu). It can search and even replace strings across multiple files.

6.3 Examples

Below are some examples of import/export/saving using here() from within an R projct. Read more about using the here package in the Import and export page.

Importing linelist_raw.xlsx from the “data” folder in your R project

linelist <- import(here("data", "linelist_raw.xlsx"))

Exporting the R object linelist as “my_linelist.rds” to the “clean” folder within the “data” folder in your R project.

export(linelist, here("data","clean", "my_linelist.rds"))

Saving the most recently printed plot as “epicurve_2021-02-15.png” within the “epicurves” folder in “outputs” folder in your R project.

ggsave(here("outputs", "epicurves", "epicurve_2021-02-15.png"))

6.4 Resources

RStudio webpage on using R projects

7 Import and export

In this page we describe ways to locate, import, and export files:

  • Use of the rio package to flexibly import() and export() many types of files
  • Use of the here package to locate files relative to an R project root - to prevent complications from file paths that are specific to one computer
  • Specific import scenarios, such as:
    • Specific Excel sheets
    • Messy headers and skipping rows
    • From Google sheets
    • From data posted to websites
    • With APIs
    • Importing the most recent file
  • Manual data entry
  • R-specific file types such as RDS and RData
  • Exporting/saving files and plots

7.1 Overview

When you import a “dataset” into R, you are generally creating a new data frame object in your R environment and defining it as an imported file (e.g. Excel, CSV, TSV, RDS) that is located in your folder directories at a certain file path/address.

You can import/export many types of files, including those created by other statistical programs (SAS, STATA, SPSS). You can also connect to relational databases.

R even has its own data formats:

  • An RDS file (.rds) stores a single R object such as a data frame. These are useful to store cleaned data, as they maintain R column classes. Read more in this section.
  • An RData file (.Rdata) can be used to store multiple objects, or even a complete R workspace. Read more in this section.

7.2 The rio package

The R package we recommend is: rio. The name “rio” is an abbreviation of “R I/O” (input/output).

Its functions import() and export() can handle many different file types (e.g. .xlsx, .csv, .rds, .tsv). When you provide a file path to either of these functions (including the file extension like “.csv”), rio will read the extension and use the correct tool to import or export the file.

The alternative to using rio is to use functions from many other packages, each of which is specific to a type of file. For example, read.csv() (base R), read.xlsx() (openxlsx package), and write_csv() (readr pacakge), etc. These alternatives can be difficult to remember, whereas using import() and export() from rio is easy.

rio’s functions import() and export() use the appropriate package and function for a given file, based on its file extension. See the end of this page for a complete table of which packages/functions rio uses in the background. It can also be used to import STATA, SAS, and SPSS files, among dozens of other file types.

Import/export of shapefiles requires other packages, as detailed in the page on GIS basics.

7.3 The here package

The package here and its function here() make it easy to tell R where to find and to save your files - in essence, it builds file paths.

Used in conjunction with an R project, here allows you to describe the location of files in your R project in relation to the R project’s root directory (the top-level folder). This is useful when the R project may be shared or accessed by multiple people/computers. It prevents complications due to the unique file paths on different computers (e.g. "C:/Users/Laura/Documents..." by “starting” the file path in a place common to all users (the R project root).

This is how here() works within an R project:

  • When the here package is first loaded within the R project, it places a small file called “.here” in the root folder of your R project as a “benchmark” or “anchor”
  • In your scripts, to reference a file in the R project’s sub-folders, you use the function here() to build the file path in relation to that anchor
  • To build the file path, write the names of folders beyond the root, within quotes, separated by commas, finally ending with the file name and file extension as shown below
  • here() file paths can be used for both importing and exporting

For example, below, the function import() is being provided a file path constructed with here().

linelist <- import(here("data", "linelists", "ebola_linelist.xlsx"))

The command here("data", "linelists", "ebola_linelist.xlsx") is actually providing the full file path that is unique to the user’s computer:

"C:/Users/Laura/Documents/my_R_project/data/linelists/ebola_linelist.xlsx"

The beauty is that the R command using here() can be successfully run on any computer accessing the R project.

TIP: If you are unsure where the “.here” root is set to, run the function here() with empty parentheses.

Read more about the here package at this link.

7.4 File paths

When importing or exporting data, you must provide a file path. You can do this one of three ways:

  1. Recommended: provide a “relative” file path with the here package
  2. Provide the “full” / “absolute” file path
  3. Manual file selection

“Relative” file paths

In R, “relative” file paths consist of the file path relative to the root of an R project. They allow for more simple file paths that can work on different computers (e.g. if the R project is on a shared drive or is sent by email). As described above, relative file paths are facilitated by use of the here package.

An example of a relative file path constructed with here() is below. We assume the work is in an R project that contains a sub-folder “data” and within that a subfolder “linelists”, in which there is the .xlsx file of interest.

linelist <- import(here("data", "linelists", "ebola_linelist.xlsx"))

“Absolute” file paths

Absolute or “full” file paths can be provided to functions like import() but they are “fragile” as they are unique to the user’s specific computer and therefore not recommended.

Below is an example of an absolute file path, where in Laura’s computer there is a folder “analysis”, a sub-folder “data” and within that a sub-folder “linelists”, in which there is the .xlsx file of interest.

linelist <- import("C:/Users/Laura/Documents/analysis/data/linelists/ebola_linelist.xlsx")

A few things to note about absolute file paths:

  • Avoid using absolute file paths as they will break if the script is run on a different computer
  • Use forward slashes (/), as in the example above (note: this is NOT the default for Windows file paths)
  • File paths that begin with double slashes (e.g. “//…”) will likely not be recognized by R and will produce an error. Consider moving your work to a “named” or “lettered” drive that begins with a letter (e.g. “J:” or “C:”). See the page on Directory interactions for more details on this issue.

One scenario where absolute file paths may be appropriate is when you want to import a file from a shared drive that has the same full file path for all users.

TIP: To quickly convert all \ to /, highlight the code of interest, use Ctrl+f (in Windows), check the option box for “In selection”, and then use the replace functionality to convert them.

Select file manually

You can import data manually via one of these methods:

  1. Environment RStudio Pane, click “Import Dataset”, and select the type of data
  2. Click File / Import Dataset / (select the type of data)
  3. To hard-code manual selection, use the base R command file.choose() (leaving the parentheses empty) to trigger appearance of a pop-up window that allows the user to manually select the file from their computer. For example:
# Manual selection of a file. When this command is run, a POP-UP window will appear. 
# The file path selected will be supplied to the import() command.

my_data <- import(file.choose())

TIP: The pop-up window may appear BEHIND your RStudio window.

7.5 Import data

To use import() to import a dataset is quite simple. Simply provide the path to the file (including the file name and file extension) in quotes. If using here() to build the file path, follow the instructions above. Below are a few examples:

Importing a csv file that is located in your “working directory” or in the R project root folder:

linelist <- import("linelist_cleaned.csv")

Importing the first sheet of an Excel workbook that is located in “data” and “linelists” sub-folders of the R project (the file path built using here()):

linelist <- import(here("data", "linelists", "linelist_cleaned.xlsx"))

Importing a data frame (a .rds file) using an absolute file path:

linelist <- import("C:/Users/Laura/Documents/tuberculosis/data/linelists/linelist_cleaned.rds")

Specific Excel sheets

By default, if you provide an Excel workbook (.xlsx) to import(), the workbook’s first sheet will be imported. If you want to import a specific sheet, include the sheet name to the which = argument. For example:

my_data <- import("my_excel_file.xlsx", which = "Sheetname")

If using the here() method to provide a relative pathway to import(), you can still indicate a specific sheet by adding the which = argument after the closing parentheses of the here() function.

# Demonstration: importing a specific Excel sheet when using relative pathways with the 'here' package
linelist_raw <- import(here("data", "linelist.xlsx"), which = "Sheet1")`  

To export a data frame from R to a specific Excel sheet and have the rest of the Excel workbook remain unchanged, you will have to import, edit, and export with an alternative package catered to this purpose such as openxlsx. See more information in the page on Directory interactions or at this github page.

If your Excel workbook is .xlsb (binary format Excel workbook) you may not be able to import it using rio. Consider re-saving it as .xlsx, or using a package like readxlsb which is built for this purpose.

Missing values

You may want to designate which value(s) in your dataset should be considered as missing. As explained in the page on Missing data, the value in R for missing data is NA, but perhaps the dataset you want to import uses 99, “Missing”, or just empty character space "" instead.

Use the na = argument for import() and provide the value(s) within quotes (even if they are numbers). You can specify multiple values by including them within a vector, using c() as shown below.

Here, the value “99” in the imported dataset is considered missing and converted to NA in R.

linelist <- import(here("data", "my_linelist.xlsx"), na = "99")

Here, any of the values “Missing”, "" (empty cell), or " " (single space) in the imported dataset are converted to NA in R.

linelist <- import(here("data", "my_linelist.csv"), na = c("Missing", "", " "))

Skip rows

Sometimes, you may want to avoid importing a row of data. You can do this with the argument skip = if using import() from rio on a .xlsx or .csv file. Provide the number of rows you want to skip.

linelist_raw <- import("linelist_raw.xlsx", skip = 1)  # does not import header row

Unfortunately skip = only accepts one integer value, not a range (e.g. “2:10” does not work). To skip import of specific rows that are not consecutive from the top, consider importing multiple times and using bind_rows() from dplyr. See the example below of skipping only row 2.

Manage a second header row

Sometimes, your data may have a second row, for example if it is a “data dictionary” row as shown below. This situation can be problematic because it can result in all columns being imported as class “character”.

Below is an example of this kind of dataset (with the first row being the data dictionary).

Remove the second header row

To drop the second header row, you will likely need to import the data twice.

  1. Import the data in order to store the correct column names
  2. Import the data again, skipping the first two rows (header and second rows)
  3. Bind the correct names onto the reduced dataframe

The exact argument used to bind the correct column names depends on the type of data file (.csv, .tsv, .xlsx, etc.). This is because rio is using a different function for the different file types (see table above).

For Excel files: (col_names =)

# import first time; store the column names
linelist_raw_names <- import("linelist_raw.xlsx") %>% names()  # save true column names

# import second time; skip row 2, and assign column names to argument col_names =
linelist_raw <- import("linelist_raw.xlsx",
                       skip = 2,
                       col_names = linelist_raw_names
                       ) 

For CSV files: (col.names =)

# import first time; sotre column names
linelist_raw_names <- import("linelist_raw.csv") %>% names() # save true column names

# note argument for csv files is 'col.names = '
linelist_raw <- import("linelist_raw.csv",
                       skip = 2,
                       col.names = linelist_raw_names
                       ) 

Backup option - changing column names as a separate command

# assign/overwrite headers using the base 'colnames()' function
colnames(linelist_raw) <- linelist_raw_names

Make a data dictionary

Bonus! If you do have a second row that is a data dictionary, you can easily create a proper data dictionary from it. This tip is adapted from this post.

dict <- linelist_2headers %>%             # begin: linelist with dictionary as first row
  head(1) %>%                             # keep only column names and first dictionary row                
  pivot_longer(cols = everything(),       # pivot all columns to long format
               names_to = "Column",       # assign new column names
               values_to = "Description")

Combine the two header rows

In some cases when your raw dataset has two header rows (or more specifically, the 2nd row of data is a secondary header), you may want to “combine” them or add the values in the second header row into the first header row.

The command below will define the data frame’s column names as the combination (pasting together) of the first (true) headers with the value immediately underneath (in the first row).

names(my_data) <- paste(names(my_data), my_data[1, ], sep = "_")

Google sheets

You can import data from an online Google spreadsheet with the googlesheet4 package and by authenticating your access to the spreadsheet.

pacman::p_load("googlesheets4")

Below, a demo Google sheet is imported and saved. This command may prompt confirmation of authentification of your Google account. Follow prompts and pop-ups in your internet browser to grant Tidyverse API packages permissions to edit, create, and delete your spreadsheets in Google Drive.

The sheet below is “viewable for anyone with the link” and you can try to import it.

Gsheets_demo <- read_sheet("https://docs.google.com/spreadsheets/d/1scgtzkVLLHAe5a6_eFQEwkZcc14yFUx1KgOMZ4AKUfY/edit#gid=0")

The sheet can also be imported using only the sheet ID, a shorter part of the URL:

Gsheets_demo <- read_sheet("1scgtzkVLLHAe5a6_eFQEwkZcc14yFUx1KgOMZ4AKUfY")

Another package, googledrive offers useful functions for writing, editing, and deleting Google sheets. For example, using the gs4_create() and sheet_write() functions found in this package.

Here are some other helpful online tutorials:
basic Google sheets importing tutorial
more detailed tutorial
interaction between the googlesheets4 and tidyverse

7.6 Multiple files - import, export, split, combine

See the page on Iteration, loops, and lists for examples of how to import and combine multiple files, or multiple Excel workbook files. That page also has examples on how to split a data frame into parts and export each one separately, or as named sheets in an Excel workbook.

7.7 Import from Github

Importing data directly from Github into R can be very easy or can require a few steps - depending on the file type. Below are some approaches:

CSV files

It can be easy to import a .csv file directly from Github into R with an R command.

  1. Go to the Github repo, locate the file of interest, and click on it
  2. Click on the “Raw” button (you will then see the “raw” csv data, as shown below)
  3. Copy the URL (web address)
  4. Place the URL in quotes within the import() R command

XLSX files

You may not be able to view the “Raw” data for some files (e.g. .xlsx, .rds, .nwk, .shp)

  1. Go to the Github repo, locate the file of interest, and click on it
  2. Click the “Download” button, as shown below
  3. Save the file on your computer, and import it into R

Shapefiles

Shapefiles have many sub-component files, each with a different file extention. One file will have the “.shp” extension, but others may have “.dbf”, “.prj”, etc. To download a shapefile from Github, you will need to download each of the sub-component files individually, and save them in the same folder on your computer. In Github, click on each file individually and download them by clicking on the “Download” button.

Once saved to your computer you can import the shapefile as shown in the GIS basics page using st_read() from the sf package. You only need to provide the filepath and name of the “.shp” file - as long as the other related files are within the same folder on your computer.

Below, you can see how the shapefile “sle_adm3” consists of many files - each of which must be downloaded from Github.

7.8 Manual data entry

Entry by rows

Use the tribble function from the tibble package from the tidyverse (online tibble reference).

Note how column headers start with a tilde (~). Also note that each column must contain only one class of data (character, numeric, etc.). You can use tabs, spacing, and new rows to make the data entry more intuitive and readable. Spaces do not matter between values, but each row is represented by a new line of code. For example:

# create the dataset manually by row
manual_entry_rows <- tibble::tribble(
  ~colA, ~colB,
  "a",   1,
  "b",   2,
  "c",   3
  )

And now we display the new dataset:

Entry by columns

Since a data frame consists of vectors (vertical columns), the base approach to manual dataframe creation in R expects you to define each column and then bind them together. This can be counter-intuitive in epidemiology, as we usually think about our data in rows (as above).

# define each vector (vertical column) separately, each with its own name
PatientID <- c(235, 452, 778, 111)
Treatment <- c("Yes", "No", "Yes", "Yes")
Death     <- c(1, 0, 1, 0)

CAUTION: All vectors must be the same length (same number of values).

The vectors can then be bound together using the function data.frame():

# combine the columns into a data frame, by referencing the vector names
manual_entry_cols <- data.frame(PatientID, Treatment, Death)

And now we display the new dataset:

Pasting from clipboard

If you copy data from elsewhere and have it on your clipboard, you can try one of the two ways below:

From the clipr package, you can use read_clip_tbl() to import as a data frame, or just just read_clip() to import as a character vector. In both cases, leave the parentheses empty.

linelist <- clipr::read_clip_tbl()  # imports current clipboard as data frame
linelist <- clipr::read_clip()      # imports as character vector

You can also easily export to your system’s clipboard with clipr. See the section below on Export.

Alternatively, you can use the the read.table() function from base R with file = "clipboard") to import as a data frame:

df_from_clipboard <- read.table(
  file = "clipboard",  # specify this as "clipboard"
  sep = "t",           # separator could be tab, or commas, etc.
  header=TRUE)         # if there is a header row

7.9 Import most recent file

Often you may receive daily updates to your datasets. In this case you will want to write code that imports the most recent file. Below we present two ways to approach this:

  • Selecting the file based on the date in the file name
  • Selecting the file based on file metadata (last modification)

Dates in file name

This approach depends on three premises:

  1. You trust the dates in the file names
  2. The dates are numeric and appear in generally the same format (e.g. year then month then day)
  3. There are no other numbers in the file name

We will explain each step, and then show you them combined at the end.

First, use dir() from base R to extract just the file names for each file in the folder of interest. See the page on Directory interactions for more details about dir(). In this example, the folder of interest is the folder “linelists” within the folder “example” within “data” within the R project.

linelist_filenames <- dir(here("data", "example", "linelists")) # get file names from folder
linelist_filenames                                              # print
## [1] "20201007linelist.csv"          "case_linelist_2020-10-02.csv"  "case_linelist_2020-10-03.csv"  "case_linelist_2020-10-04.csv"  "case_linelist_2020-10-05.csv" 
## [6] "case_linelist_2020-10-08.xlsx" "case_linelist20201006.csv"

Once you have this vector of names, you can extract the dates from them by applying str_extract() from stringr using this regular expression. It extracts any numbers in the file name (including any other characters in the middle such as dashes or slashes). You can read more about stringr in the [Strings and characters] page.

linelist_dates_raw <- stringr::str_extract(linelist_filenames, "[0-9].*[0-9]") # extract numbers and any characters in between
linelist_dates_raw  # print
## [1] "20201007"   "2020-10-02" "2020-10-03" "2020-10-04" "2020-10-05" "2020-10-08" "20201006"

Assuming the dates are written in generally the same date format (e.g. Year then Month then Day) and the years are 4-digits, you can use lubridate’s flexible conversion functions (ymd(), dmy(), or mdy()) to convert them to dates. For these functions, the dashes, spaces, or slashes do not matter, only the order of the numbers. Read more in the Working with dates page.

linelist_dates_clean <- lubridate::ymd(linelist_dates_raw)
linelist_dates_clean
## [1] "2020-10-07" "2020-10-02" "2020-10-03" "2020-10-04" "2020-10-05" "2020-10-08" "2020-10-06"

The base R function which.max() can then be used to return the index position (e.g. 1st, 2nd, 3rd, …) of the maximum date value. The latest file is correctly identified as the 6th file - “case_linelist_2020-10-08.xlsx”.

index_latest_file <- which.max(linelist_dates_clean)
index_latest_file
## [1] 6

If we condense all these commands, the complete code could look like below. Note that the . in the last line is a placeholder for the piped object at that point in the pipe sequence. At that point the value is simply the number 6. This is placed in double brackets to extract the 6th element of the vector of file names produced by dir().

# load packages
pacman::p_load(
  tidyverse,         # data management
  stringr,           # work with strings/characters
  lubridate,         # work with dates
  rio,               # import / export
  here,              # relative file paths
  fs)                # directory interactions

# extract the file name of latest file
latest_file <- dir(here("data", "example", "linelists")) %>%  # file names from "linelists" sub-folder          
  str_extract("[0-9].*[0-9]") %>%                  # pull out dates (numbers)
  ymd() %>%                                        # convert numbers to dates (assuming year-month-day format)
  which.max() %>%                                  # get index of max date (latest file)
  dir(here("data", "example", "linelists"))[[.]]              # return the filename of latest linelist

latest_file  # print name of latest file
## [1] "case_linelist_2020-10-08.xlsx"

You can now use this name to finish the relative file path, with here():

here("data", "example", "linelists", latest_file) 

And you can now import the latest file:

# import
import(here("data", "example", "linelists", latest_file)) # import 

Use the file info

If your files do not have dates in their names (or you do not trust those dates), you can try to extract the last modification date from the file metadata. Use functions from the package fs to examine the metadata information for each file, which includes the last modification time and the file path.

Below, we provide the folder of interest to fs’s dir_info(). In this case, the folder of interest is in the R project in the folder “data”, the sub-folder “example”, and its sub-folder “linelists”. The result is a data frame with one line per file and columns for modification_time, path, etc. You can see a visual example of this in the page on Directory interactions.

We can sort this data frame of files by the column modification_time, and then keep only the top/latest row (file) with base R’s head(). Then we can extract the file path of this latest file only with the dplyr function pull() on the column path. Finally we can pass this file path to import(). The imported file is saved as latest_file.

latest_file <- dir_info(here("data", "example", "linelists")) %>%  # collect file info on all files in directory
  arrange(desc(modification_time)) %>%      # sort by modification time
  head(1) %>%                               # keep only the top (latest) file
  pull(path) %>%                            # extract only the file path
  import()                                  # import the file

7.10 APIs

An “Automated Programming Interface” (API) can be used to directly request data from a website. APIs are a set of rules that allow one software application to interact with another. The client (you) sends a “request” and receives a “response” containing content. The R packages httr and jsonlite can facilitate this process.

Each API-enabled website will have its own documentation and specifics to become familiar with. Some sites are publicly available and can be accessed by anyone. Others, such as platforms with user IDs and credentials, require authentication to access their data.

Needless to say, it is necessary to have an internet connection to import data via API. We will briefly give examples of use of APIs to import data, and link you to further resources.

Note: recall that data may be posted* on a website without an API, which may be easier to retrieve. For example a posted CSV file may be accessible simply by providing the site URL to import() as described in the section on importing from Github.*

HTTP request

The API exchange is most commonly done through an HTTP request. HTTP is Hypertext Transfer Protocol, and is the underlying format of a request/response between a client and a server. The exact input and output may vary depending on the type of API but the process is the same - a “Request” (often HTTP Request) from the user, often containing a query, followed by a “Response”, containing status information about the request and possibly the requested content.

Here are a few components of an HTTP request:

  • The URL of the API endpoint
  • The “Method” (or “Verb”)
  • Headers
  • Body

The HTTP request “method” is the action your want to perform. The two most common HTTP methods are GET and POST but others could include PUT, DELETE, PATCH, etc. When importing data into R it is most likely that you will use GET.

After your request, your computer will receive a “response” in a format similar to what you sent, including URL, HTTP status (Status 200 is what you want!), file type, size, and the desired content. You will then need to parse this response and turn it into a workable data frame within your R environment.

Packages

The httr package works well for handling HTTP requests in R. It requires little prior knowledge of Web APIs and can be used by people less familiar with software development terminology. In addition, if the HTTP response is .json, you can use jsonlite to parse the response.

# load packages
pacman::p_load(httr, jsonlite, tidyverse)

Publicly-available data

Below is an example of an HTTP request, borrowed from a tutorial from the Trafford Data Lab. This site has several other resources to learn and API exercises.

Scenario: We want to import a list of fast food outlets in the city of Trafford, UK. The data can be accessed from the API of the Food Standards Agency, which provides food hygiene rating data for the United Kingdom.

Here are the parameters for our request:

The R code would be as follows:

# prepare the request
path <- "http://api.ratings.food.gov.uk/Establishments"
request <- GET(url = path,
             query = list(
               localAuthorityId = 188,
               BusinessTypeId = 7844,
               pageNumber = 1,
               pageSize = 5000),
             add_headers("x-api-version" = "2"))

# check for any server error ("200" is good!)
request$status_code

# submit the request, parse the response, and convert to a data frame
response <- content(request, as = "text", encoding = "UTF-8") %>%
  fromJSON(flatten = TRUE) %>%
  pluck("establishments") %>%
  as_tibble()

You can now clean and use the response data frame, which contains one row per fast food facility.

Authentication required

Some APIs require authentication - for you to prove who you are, so you can access restricted data. To import these data, you may need to first use a POST method to provide a username, password, or code. This will return an access token, that can be used for subsequent GET method requests to retrieve the desired data.

Below is an example of querying data from Go.Data, which is an outbreak investigation tool. Go.Data uses an API for all interactions between the web front-end and smartphone applications used for data collection. Go.Data is used throughout the world. Because outbreak data are sensitive and you should only be able to access data for your outbreak, authentication is required.

Below is some sample R code using httr and jsonlite for connecting to the Go.Data API to import data on contact follow-up from your outbreak.

# set credentials for authorization
url <- "https://godatasampleURL.int/"           # valid Go.Data instance url
username <- "username"                          # valid Go.Data username 
password <- "password"                          # valid Go,Data password 
outbreak_id <- "xxxxxx-xxxx-xxxx-xxxx-xxxxxxx"  # valid Go.Data outbreak ID

# get access token
url_request <- paste0(url,"api/oauth/token?access_token=123") # define base URL request

# prepare request
response <- POST(
  url = url_request,  
  body = list(
    username = username,    # use saved username/password from above to authorize                               
    password = password),                                       
    encode = "json")

# execute request and parse response
content <-
  content(response, as = "text") %>%
  fromJSON(flatten = TRUE) %>%          # flatten nested JSON
  glimpse()

# Save access token from response
access_token <- content$access_token    # save access token to allow subsequent API calls below

# import outbreak contacts
# Use the access token 
response_contacts <- GET(
  paste0(url,"api/outbreaks/",outbreak_id,"/contacts"),          # GET request
  add_headers(
    Authorization = paste("Bearer", access_token, sep = " ")))

json_contacts <- content(response_contacts, as = "text")         # convert to text JSON

contacts <- as_tibble(fromJSON(json_contacts, flatten = TRUE))   # flatten JSON to tibble

CAUTION: If you are importing large amounts of data from an API requiring authentication, it may time-out. To avoid this, retrieve access_token again before each API GET request and try using filters or limits in the query.

TIP: The fromJSON() function in the jsonlite package does not fully un-nest the first time it’s executed, so you will likely still have list items in your resulting tibble. You will need to further un-nest for certain variables; depending on how nested your .json is. To view more info on this, view the documentation for the jsonlite package, such as the flatten() function.

For more details, View documentation on LoopBack Explorer, the Contact Tracing page or API tips on Go.Data Github repository

You can read more about the httr package here

This section was also informed by this tutorial and this tutorial.

7.11 Export

With rio package

With rio, you can use the export() function in a very similar way to import(). First give the name of the R object you want to save (e.g. linelist) and then in quotes put the file path where you want to save the file, including the desired file name and file extension. For example:

This saves the data frame linelist as an Excel workbook to the working directory/R project root folder:

export(linelist, "my_linelist.xlsx") # will save to working directory

You could save the same data frame as a csv file by changing the extension. For example, we also save it to a file path constructed with here():

export(linelist, here("data", "clean", "my_linelist.csv"))

To clipboard

To export a data frame to your computer’s “clipboard” (to then paste into another software like Excel, Google Spreadsheets, etc.) you can use write_clip() from the clipr package.

# export the linelist data frame to your system's clipboard
clipr::write_clip(linelist)

7.12 RDS files

Along with .csv, .xlsx, etc, you can also export/save R data frames as .rds files. This is a file format specific to R, and is very useful if you know you will work with the exported data again in R.

The classes of columns are stored, so you don’t have do to cleaning again when it is imported (with an Excel or even a CSV file this can be a headache!). It is also a smaller file, which is useful for export and import if your dataset is large.

For example, if you work in an Epidemiology team and need to send files to a GIS team for mapping, and they use R as well, just send them the .rds file! Then all the column classes are retained and they have less work to do.

export(linelist, here("data", "clean", "my_linelist.rds"))

7.13 Rdata files and lists

.Rdata files can store multiple R objects - for example multiple data frames, model results, lists, etc. This can be very useful to consolidate or share a lot of your data for a given project.

In the below example, multiple R objects are stored within the exported file “my_objects.Rdata”:

rio::export(my_list, my_dataframe, my_vector, "my_objects.Rdata")

Note: if you are trying to import a list, use import_list() from rio to import it with the complete original structure and contents.

rio::import_list("my_list.Rdata")

7.14 Saving plots

Instructions on how to save plots, such as those created by ggplot(), are discussed in depth in the ggplot basics page.

In brief, run ggsave("my_plot_filepath_and_name.png") after printing your plot. You can either provide a saved plot object to the plot = argument, or only specify the destination file path (with file extension) to save the most recently-displayed plot. You can also control the width =, height =, units =, and dpi =.

How to save a network graph, such as a transmission tree, is addressed in the page on Transmission chains.

7.15 Resources

The R Data Import/Export Manual
R 4 Data Science chapter on data import
ggsave() documentation

Below is a table, taken from the rio online vignette. For each type of data it shows: the expected file extension, the package rio uses to import or export the data, and whether this functionality is included in the default installed version of rio.

Format Typical Extension Import Package Export Package Installed by Default
Comma-separated data .csv data.table fread() data.table Yes
Pipe-separated data .psv data.table fread() data.table Yes
Tab-separated data .tsv data.table fread() data.table Yes
SAS .sas7bdat haven haven Yes
SPSS .sav haven haven Yes
Stata .dta haven haven Yes
SAS XPORT .xpt haven haven
SPSS Portable .por haven Yes
Excel .xls readxl Yes
Excel .xlsx readxl openxlsx Yes
R syntax .R base base Yes
Saved R objects .RData, .rda base base Yes
Serialized R objects .rds base base Yes
Epiinfo .rec foreign Yes
Minitab .mtp foreign Yes
Systat .syd foreign Yes
“XBASE” database files .dbf foreign foreign
Weka Attribute-Relation File Format .arff foreign foreign Yes
Data Interchange Format .dif utils Yes
Fortran data no recognized extension utils Yes
Fixed-width format data .fwf utils utils Yes
gzip comma-separated data .csv.gz utils utils Yes
CSVY (CSV + YAML metadata header) .csvy csvy csvy No
EViews .wf1 hexView No
Feather R/Python interchange format .feather feather feather No
Fast Storage .fst fst fst No
JSON .json jsonlite jsonlite No
Matlab .mat rmatio rmatio No
OpenDocument Spreadsheet .ods readODS readODS No
HTML Tables .html xml2 xml2 No
Shallow XML documents .xml xml2 xml2 No
YAML .yml yaml yaml No
Clipboard default is tsv clipr clipr No

(PART) Data Management

8 Cleaning data and core functions

This page demonstrates common steps used in the process of “cleaning” a dataset, and also explains the use of many essential R data management functions.

To demonstrate data cleaning, this page begins by importing a raw case linelist dataset, and proceeds step-by-step through the cleaning process. In the R code, this manifests as a “pipe” chain, which references the “pipe” operator %>% that passes a dataset from one operation to the next.

Core functions

This handbook emphasizes use of the functions from the tidyverse family of R packages. The essential R functions demonstrated in this page are listed below.

Many of these functions belong to the dplyr R package, which provides “verb” functions to solve data manipulation challenges (the name is a reference to a "data frame-plier. dplyr is part of the tidyverse family of R packages (which also includes ggplot2, tidyr, stringr, tibble, purrr, magrittr, and forcats among others).

Function Utility Package
%>% “pipe” (pass) data from one function to the next magrittr
mutate() create, transform, and re-define columns dplyr
select() keep, remove, select, or re-name columns dplyr
rename() rename columns dplyr
clean_names() standardize the syntax of column names janitor
as.character(), as.numeric(), as.Date(), etc. convert the class of a column base R
across() transform multiple columns at one time dplyr
tidyselect functions use logic to select columns tidyselect
filter() keep certain rows dplyr
distinct() de-duplicate rows dplyr
rowwise() operations by/within each row dplyr
add_row() add rows manually tibble
arrange() sort rows dplyr
recode() re-code values in a column dplyr
case_when() re-code values in a column using more complex logical criteria dplyr
replace_na(), na_if(), coalesce() special functions for re-coding tidyr
age_categories() and cut() create categorical groups from a numeric column epikit and base R
clean_variable_spelling() re-code/clean values using a data dictionary linelist
which() apply logical criteria; return indices base R

If you want to see how these functions compare to Stata or SAS commands, see the page on Transition to R.

You may encounter an alternative data management framework from the data.table R package with operators like := and frequent use of brackets [ ]. This approach and syntax is briefly explained in the Data Table page.

Nomenclature

In this handbook, we generally reference “columns” and “rows” instead of “variables” and “observations”. As explained in this primer on “tidy data”, most epidemiological statistical datasets consist structurally of rows, columns, and values.

Variables contain the values that measure the same underlying attribute (like age group, outcome, or date of onset). Observations contain all values measured on the same unit (e.g. a person, site, or lab sample). So these aspects can be more difficult to tangibly define.

In “tidy” datasets, each column is a variable, each row is an observation, and each cell is a single value. However some datasets you encounter will not fit this mold - a “wide” format dataset may have a variable split across several columns (see an example in the Pivoting data page). Likewise, observations could be split across several rows.

Most of this handbook is about managing and transforming data, so referring to the concrete data structures of rows and columns is more relevant than the more abstract observations and variables. Exceptions occur primarily in pages on data analysis, where you will see more references to variables and observations.

8.1 Cleaning pipeline

This page proceeds through typical cleaning steps, adding them sequentially to a cleaning pipe chain.

In epidemiological analysis and data processing, cleaning steps are often performed sequentially, linked together. In R, this often manifests as a cleaning “pipeline”, where the raw dataset is passed or “piped” from one cleaning step to another.

Such chains utilize dplyr “verb” functions and the magrittr pipe operator %>%. This pipe begins with the “raw” data (“linelist_raw.xlsx”) and ends with a “clean” R data frame (linelist) that can be used, saved, exported, etc.

In a cleaning pipeline the order of the steps is important. Cleaning steps might include:

  • Importing of data
  • Column names cleaned or changed
  • De-duplication
  • Column creation and transformation (e.g. re-coding or standardising values)
  • Rows filtered or added

8.2 Load packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
  rio,        # importing data  
  here,       # relative file pathways  
  janitor,    # data cleaning and tables
  lubridate,  # working with dates
  epikit,     # age_categories() function
  tidyverse   # data management and visualization
)

8.3 Import data

Import

Here we import the “raw” case linelist Excel file using the import() function from the package rio. The rio package flexibly handles many types of files (e.g. .xlsx, .csv, .tsv, .rds. See the page on Import and export for more information and tips on unusual situations (e.g. skipping rows, setting missing values, importing Google sheets, etc).

If you want to follow along, click to download the “raw” linelist (as .xlsx file).

If your dataset is large and takes a long time to import, it can be useful to have the import command be separate from the pipe chain and the “raw” saved as a distinct file. This also allows easy comparison between the original and cleaned versions.

Below we import the raw Excel file and save it as the data frame linelist_raw. We assume the file is located in your working directory or R project root, and so no sub-folders are specified in the file path.

linelist_raw <- import("linelist_raw.xlsx")

You can view the first 50 rows of the the data frame below. Note: the base R function head(n) allow you to view just the first n rows in the R console.

Review

You can use the function skim() from the package skimr to get an overview of the entire dataframe (see page on Descriptive tables for more info). Columns are summarised by class/type such as character, numeric. Note: “POSIXct” is a type of raw date class (see Working with dates.

skimr::skim(linelist_raw)
(#tab:unnamed-chunk-155)Data summary
Name linelist_raw
Number of rows 6611
Number of columns 28
_______________________
Column type frequency:
character 17
numeric 8
POSIXct 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
case_id 137 0.98 6 6 0 5888 0
date onset 293 0.96 10 10 0 580 0
outcome 1500 0.77 5 7 0 2 0
gender 324 0.95 1 1 0 2 0
hospital 1512 0.77 5 36 0 13 0
infector 2323 0.65 6 6 0 2697 0
source 2323 0.65 5 7 0 2 0
age 107 0.98 1 2 0 75 0
age_unit 7 1.00 5 6 0 2 0
fever 258 0.96 2 3 0 2 0
chills 258 0.96 2 3 0 2 0
cough 258 0.96 2 3 0 2 0
aches 258 0.96 2 3 0 2 0
vomit 258 0.96 2 3 0 2 0
time_admission 844 0.87 5 5 0 1091 0
merged_header 0 1.00 1 1 0 1 0
…28 0 1.00 1 1 0 1 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
generation 7 1.00 16.60 5.71 0.00 13.00 16.00 20.00 37.00
lon 7 1.00 -13.23 0.02 -13.27 -13.25 -13.23 -13.22 -13.21
lat 7 1.00 8.47 0.01 8.45 8.46 8.47 8.48 8.49
row_num 0 1.00 3240.91 1857.83 1.00 1647.50 3241.00 4836.50 6481.00
wt_kg 7 1.00 52.69 18.59 -11.00 41.00 54.00 66.00 111.00
ht_cm 7 1.00 125.25 49.57 4.00 91.00 130.00 159.00 295.00
ct_blood 7 1.00 21.26 1.67 16.00 20.00 22.00 22.00 26.00
temp 158 0.98 38.60 0.95 35.20 38.30 38.80 39.20 40.80

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
infection date 2322 0.65 2012-04-09 2015-04-27 2014-10-04 538
hosp date 7 1.00 2012-04-20 2015-04-30 2014-10-15 570
date_of_outcome 1068 0.84 2012-05-14 2015-06-04 2014-10-26 575

8.4 Column names

In R, column names are the “header” or “top” value of a column. They are used to refer to columns in the code, and serve as a default label in figures.

Other statistical software such as SAS and STATA use “labels” that co-exist as longer printed versions of the shorter column names. While R does offer the possibility of adding column labels to the data, this is not emphasized in most practice. To make column names “printer-friendly” for figures, one typically adjusts their display within the plotting commands that create the outputs (e.g. axis or legend titles of a plot, or column headers in a printed table - see the scales section of the ggplot tips page and Tables for presentation pages). If you want to assign column labels in the data, read more online here and here.

As R column names are used very often, so they must have “clean” syntax. We suggest the following:

  • Short names
  • No spaces (replace with underscores _ )
  • No unusual characters (&, #, <, >, …)
  • Similar style nomenclature (e.g. all date columns named like date_onset, date_report, date_death…)

The columns names of linelist_raw are printed below using names() from base R. We can see that initially:

  • Some names contain spaces (e.g. infection date)
  • Different naming patterns are used for dates (date onset vs. infection date)
  • There must have been a merged header across the two last columns in the .xlsx. We know this because the name of two merged columns (“merged_header”) was assigned by R to the first column, and the second column was assigned a placeholder name “…28” (as it was then empty and is the 28th column).
names(linelist_raw)
##  [1] "case_id"         "generation"      "infection date"  "date onset"      "hosp date"       "date_of_outcome" "outcome"         "gender"         
##  [9] "hospital"        "lon"             "lat"             "infector"        "source"          "age"             "age_unit"        "row_num"        
## [17] "wt_kg"           "ht_cm"           "ct_blood"        "fever"           "chills"          "cough"           "aches"           "vomit"          
## [25] "temp"            "time_admission"  "merged_header"   "...28"

NOTE: To reference a column name that includes spaces, surround the name with back-ticks, for example: linelist$` '\x60infection date\x60'`. note that on your keyboard, the back-tick (`) is different from the single quotation mark (’).

Automatic cleaning

The function clean_names() from the package janitor standardizes column names and makes them unique by doing the following:

  • Converts all names to consist of only underscores, numbers, and letters
  • Accented characters are transliterated to ASCII (e.g. german o with umlaut becomes “o”, spanish “enye” becomes “n”)
  • Capitalization preference for the new column names can be specified using the case = argument (“snake” is default, alternatives include “sentence”, “title”, “small_camel”…)
  • You can specify specific name replacements by providing a vector to the replace = argument (e.g. replace = c(onset = "date_of_onset"))
  • Here is an online vignette

Below, the cleaning pipeline begins by using clean_names() on the raw linelist.

# pipe the raw dataset through the function clean_names(), assign result as "linelist"  
linelist <- linelist_raw %>% 
  janitor::clean_names()

# see the new column names
names(linelist)
##  [1] "case_id"         "generation"      "infection_date"  "date_onset"      "hosp_date"       "date_of_outcome" "outcome"         "gender"         
##  [9] "hospital"        "lon"             "lat"             "infector"        "source"          "age"             "age_unit"        "row_num"        
## [17] "wt_kg"           "ht_cm"           "ct_blood"        "fever"           "chills"          "cough"           "aches"           "vomit"          
## [25] "temp"            "time_admission"  "merged_header"   "x28"

NOTE: The last column name “…28” was changed to “x28”.

Manual name cleaning

Re-naming columns manually is often necessary, even after the standardization step above. Below, re-naming is performed using the rename() function from the dplyr package, as part of a pipe chain. rename() uses the style NEW = OLD - the new column name is given before the old column name.

Below, a re-naming command is added to the cleaning pipeline. Spaces have been added strategically to align code for easier reading.

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome)

Now you can see that the columns names have been changed:

##  [1] "case_id"              "generation"           "date_infection"       "date_onset"           "date_hospitalisation" "date_outcome"         "outcome"             
##  [8] "gender"               "hospital"             "lon"                  "lat"                  "infector"             "source"               "age"                 
## [15] "age_unit"             "row_num"              "wt_kg"                "ht_cm"                "ct_blood"             "fever"                "chills"              
## [22] "cough"                "aches"                "vomit"                "temp"                 "time_admission"       "merged_header"        "x28"

Rename by column position

You can also rename by column position, instead of column name, for example:

rename(newNameForFirstColumn  = 1,
       newNameForSecondColumn = 2)

Rename via select() and summarise()

As a shortcut, you can also rename columns within the dplyr select() and summarise() functions. select() is used to keep only certain columns (and is covered later in this page). summarise() is covered in the Grouping data and Descriptive tables pages. These functions also uses the format new_name = old_name. Here is an example:

linelist_raw %>% 
  select(# NEW name             # OLD name
         date_infection       = `infection date`,    # rename and KEEP ONLY these columns
         date_hospitalisation = `hosp date`)

Other challenges

Empty Excel column names

R cannot have dataset columns that do not have column names (headers). So, if you import an Excel dataset with data but no column headers, R will fill-in the headers with names like “…1” or “…2”. The number represents the column number (e.g. if the 4th column in the dataset has no header, then R will name it “…4”).

You can clean these names manually by referencing their position number (see example above), or their assigned name (linelist_raw$...1).

Merged Excel column names and cells

Merged cells in an Excel file are a common occurrence when receiving data. As explained in Transition to R, merged cells can be nice for human reading of data, but are not “tidy data” and cause many problems for machine reading of data. R cannot accommodate merged cells.

Remind people doing data entry that human-readable data is not the same as machine-readable data. Strive to train users about the principles of tidy data. If at all possible, try to change procedures so that data arrive in a tidy format without merged cells.

  • Each variable must have its own column.
  • Each observation must have its own row.
  • Each value must have its own cell.

When using rio’s import() function, the value in a merged cell will be assigned to the first cell and subsequent cells will be empty.

One solution to deal with merged cells is to import the data with the function readWorkbook() from the package openxlsx. Set the argument fillMergedCells = TRUE. This gives the value in a merged cell to all cells within the merge range.

linelist_raw <- openxlsx::readWorkbook("linelist_raw.xlsx", fillMergedCells = TRUE)

DANGER: If column names are merged with readWorkbook(), you will end up with duplicate column names, which you will need to fix manually - R does not work well with duplicate column names! You can re-name them by referencing their position (e.g. column 5), as explained in the section on manual column name cleaning.

8.5 Select or re-order columns

Use select() from dplyr to select the columns you want to retain, and to specify their order in the data frame.

CAUTION: In the examples below, the linelist data frame is modified with select() and displayed, but not saved. This is for demonstration purposes. The modified column names are printed by piping the data frame to names().

Here are ALL the column names in the linelist at this point in the cleaning pipe chain:

names(linelist)
##  [1] "case_id"              "generation"           "date_infection"       "date_onset"           "date_hospitalisation" "date_outcome"         "outcome"             
##  [8] "gender"               "hospital"             "lon"                  "lat"                  "infector"             "source"               "age"                 
## [15] "age_unit"             "row_num"              "wt_kg"                "ht_cm"                "ct_blood"             "fever"                "chills"              
## [22] "cough"                "aches"                "vomit"                "temp"                 "time_admission"       "merged_header"        "x28"

Keep columns

Select only the columns you want to remain

Put their names in the select() command, with no quotation marks. They will appear in the data frame in the order you provide. Note that if you include a column that does not exist, R will return an error (see use of any_of() below if you want no error in this situation).

# linelist dataset is piped through select() command, and names() prints just the column names
linelist %>% 
  select(case_id, date_onset, date_hospitalisation, fever) %>% 
  names()  # display the column names
## [1] "case_id"              "date_onset"           "date_hospitalisation" "fever"

“tidyselect” helper functions

These helper functions exist to make it easy to specify columns to keep, discard, or transform. They are from the package tidyselect, which is included in tidyverse and underlies how columns are selected in dplyr functions.

For example, if you want to re-order the columns, everything() is a useful function to signify “all other columns not yet mentioned”. The command below moves columns date_onset and date_hospitalisation to the beginning (left) of the dataset, but keeps all the other columns afterward. Note that everything() is written with empty parentheses:

# move date_onset and date_hospitalisation to beginning
linelist %>% 
  select(date_onset, date_hospitalisation, everything()) %>% 
  names()
##  [1] "date_onset"           "date_hospitalisation" "case_id"              "generation"           "date_infection"       "date_outcome"         "outcome"             
##  [8] "gender"               "hospital"             "lon"                  "lat"                  "infector"             "source"               "age"                 
## [15] "age_unit"             "row_num"              "wt_kg"                "ht_cm"                "ct_blood"             "fever"                "chills"              
## [22] "cough"                "aches"                "vomit"                "temp"                 "time_admission"       "merged_header"        "x28"

Here are other “tidyselect” helper functions that also work within dplyr functions like select(), across(), and summarise():

  • everything() - all other columns not mentioned
  • last_col() - the last column
  • where() - applies a function to all columns and selects those which are TRUE
  • contains() - columns containing a character string
    • example: select(contains("time"))
  • starts_with() - matches to a specified prefix
    • example: select(starts_with("date_"))
  • ends_with() - matches to a specified suffix
    • example: select(ends_with("_post"))
  • matches() - to apply a regular expression (regex)
    • example: select(matches("[pt]al"))
  • num_range() - a numerical range like x01, x02, x03
  • any_of() - matches IF column exists but returns no error if it is not found
    • example: select(any_of(date_onset, date_death, cardiac_arrest))

In addition, use normal operators such as c() to list several columns, : for consecutive columns, ! for opposite, & for AND, and | for OR.

Use where() to specify logical criteria for columns. If providing a function inside where(), do not include the function’s empty parentheses. The command below selects columns that are class Numeric.

# select columns that are class Numeric
linelist %>% 
  select(where(is.numeric)) %>% 
  names()
## [1] "generation" "lon"        "lat"        "row_num"    "wt_kg"      "ht_cm"      "ct_blood"   "temp"

Use contains() to select only columns in which the column name contains a specified character string. ends_with() and starts_with() provide more nuance.

# select columns containing certain characters
linelist %>% 
  select(contains("date")) %>% 
  names()
## [1] "date_infection"       "date_onset"           "date_hospitalisation" "date_outcome"

The function matches() works similarly to contains() but can be provided a regular expression (see page on Characters and strings), such as multiple strings separated by OR bars within the parentheses:

# searched for multiple character matches
linelist %>% 
  select(matches("onset|hosp|fev")) %>%   # note the OR symbol "|"
  names()
## [1] "date_onset"           "date_hospitalisation" "hospital"             "fever"

CAUTION: If a column name that you specifically provide does not exist in the data, it can return an error and stop your code. Consider using any_of() to cite columns that may or may not exist, especially useful in negative (remove) selections.

Only one of these columns exists, but no error is produced and the code continues without stopping your cleaning chain.

linelist %>% 
  select(any_of(c("date_onset", "village_origin", "village_detection", "village_residence", "village_travel"))) %>% 
  names()
## [1] "date_onset"

Remove columns

Indicate which columns to remove by placing a minus symbol “-” in front of the column name (e.g. select(-outcome)), or a vector of column names (as below). All other columns will be retained.

linelist %>% 
  select(-c(date_onset, fever:vomit)) %>% # remove date_onset and all columns from fever to vomit
  names()
##  [1] "case_id"              "generation"           "date_infection"       "date_hospitalisation" "date_outcome"         "outcome"              "gender"              
##  [8] "hospital"             "lon"                  "lat"                  "infector"             "source"               "age"                  "age_unit"            
## [15] "row_num"              "wt_kg"                "ht_cm"                "ct_blood"             "temp"                 "time_admission"       "merged_header"       
## [22] "x28"

You can also remove a column using base R syntax, by defining it as NULL. For example:

linelist$date_onset <- NULL   # deletes column with base R syntax 

Standalone

select() can also be used as an independent command (not in a pipe chain). In this case, the first argument is the original dataframe to be operated upon.

# Create a new linelist with id and age-related columns
linelist_age <- select(linelist, case_id, contains("age"))

# display the column names
names(linelist_age)
## [1] "case_id"  "age"      "age_unit"

Add to the pipe chain

In the linelist_raw, there are a few columns we do not need: row_num, merged_header, and x28. We remove them with a select() command in the cleaning pipe chain:

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
    
    # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    #####################################################

    # remove column
    select(-c(row_num, merged_header, x28))

8.6 Deduplication

See the handbook page on De-duplication for extensive options on how to de-duplicate data. Only a very simple row de-duplication example is presented here.

The package dplyr offers the distinct() function. This function examines every row and reduce the data frame to only the unique rows. That is, it removes rows that are 100% duplicates.

When evaluating duplicate rows, it takes into account a range of columns - by default it considers all columns. As shown in the de-duplication page, you can adjust this column range so that the uniqueness of rows is only evaluated in regards to certain columns.

In this simple example, we just add the empty command distinct() to the pipe chain. This ensures there are no rows that are 100% duplicates of other rows (evaluated across all columns).

We begin with nrow(linelist) rows in linelist.

linelist <- linelist %>% 
  distinct()

After de-duplication there are nrow(linelist) rows. Any removed rows would have been 100% duplicates of other rows.

Below, the distinct() command is added to the cleaning pipe chain:

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
    
    # remove column
    select(-c(row_num, merged_header, x28)) %>% 
  
    # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    #####################################################
    
    # de-duplicate
    distinct()

8.7 Column creation and transformation

We recommend using the dplyr function mutate() to add a new column, or to modify an existing one.

Below is an example of creating a new column with mutate(). The syntax is: mutate(new_column_name = value or transformation)

In Stata, this is similar to the command generate, but R’s mutate() can also be used to modify an existing column.

New columns

The most basic mutate() command to create a new column might look like this. It creates a new column new_col where the value in every row is 10.

linelist <- linelist %>% 
  mutate(new_col = 10)

You can also reference values in other columns, to perform calculations. Below, a new column bmi is created to hold the Body Mass Index (BMI) for each case - as calculated using the formula BMI = kg/m^2, using column ht_cm and column wt_kg.

linelist <- linelist %>% 
  mutate(bmi = wt_kg / (ht_cm/100)^2)

If creating multiple new columns, separate each with a comma and new line. Below are examples of new columns, including ones that consist of values from other columns combined using str_glue() from the stringr package (see page on Characters and strings.

new_col_demo <- linelist %>%                       
  mutate(
    new_var_dup    = case_id,             # new column = duplicate/copy another existing column
    new_var_static = 7,                   # new column = all values the same
    new_var_static = new_var_static + 5,  # you can overwrite a column, and it can be a calculation using other variables
    new_var_paste  = stringr::str_glue("{hospital} on ({date_hospitalisation})") # new column = pasting together values from other columns
    ) %>% 
  select(case_id, hospital, date_hospitalisation, contains("new"))        # show only new columns, for demonstration purposes

Review the new columns. For demonstration purposes, only the new columns and the columns used to create them are shown:

TIP: A variation on mutate() is the function transmute(). This function adds a new column just like mutate(), but also drops/removes all other columns that you do not mention within its parentheses.

# HIDDEN FROM READER
# removes new demo columns created above
# linelist <- linelist %>% 
#   select(-contains("new_var"))

Convert column class

Columns containing values that are dates, numbers, or logical values (TRUE/FALSE) will only behave as expected if they are correctly classified. There is a difference between “2” of class character and 2 of class numeric!

There are ways to set column class during the import commands, but this is often cumbersome. See the R Basics section on object classes to learn more about converting the class of objects and columns.

First, let’s run some checks on important columns to see if they are the correct class. We also saw this in the beginning when we ran skim().

Currently, the class of the age column is character. To perform quantitative analyses, we need these numbers to be recognized as numeric!

class(linelist$age)
## [1] "character"

The class of the date_onset column is also character! To perform analyses, these dates must be recognized as dates!

class(linelist$date_onset)
## [1] "character"

To resolve this, use the ability of mutate() to re-define a column with a transformation. We define the column as itself, but converted to a different class. Here is a basic example, converting or ensuring that the column age is class Numeric:

linelist <- linelist %>% 
  mutate(age = as.numeric(age))

In a similar way, you can use as.character() and as.logical(). To convert to class Factor, you can use factor() from base R or as_factor() from forcats. Read more about this in the Factors page.

You must be careful when converting to class Date. Several methods are explained on the page Working with dates. Typically, the raw date values must all be in the same format for conversion to work correctly (e.g “MM/DD/YYYY”, or “DD MM YYYY”). After converting to class Date, check your data to confirm that each value was converted correctly.

Grouped data

If your data frame is already grouped (see page on Grouping data), mutate() may behave differently than if the data frame is not grouped. Any summarizing functions, like mean(), median(), max(), etc. will calculate by group, not by all the rows.

# age normalized to mean of ALL rows
linelist %>% 
  mutate(age_norm = age / mean(age, na.rm=T))

# age normalized to mean of hospital group
linelist %>% 
  group_by(hospital) %>% 
  mutate(age_norm = age / mean(age, na.rm=T))

Read more about using mutate () on grouped dataframes in this tidyverse mutate documentation.

Transform multiple columns

Often to write concise code you want to apply the same transformation to multiple columns at once. A transformation can be applied to multiple columns at once using the across() function from the package dplyr (also contained within tidyverse package). across() can be used with any dplyr function, but is commonly used within select(), mutate(), filter(), or summarise(). See how it is applied to summarise() in the page on Descriptive tables.

Specify the columns to the argument .cols = and the function(s) to apply to .fns =. Any additional arguments to provide to the .fns function can be included after a comma, still within across().

across() column selection

Specify the columns to the argument .cols =. You can name them individually, or use “tidyselect” helper functions. Specify the function to .fns =. Note that using the function mode demonstrated below, the function is written without its parentheses ( ).

Here the transformation as.character() is applied to specific columns named within across().

linelist <- linelist %>% 
  mutate(across(.cols = c(temp, ht_cm, wt_kg), .fns = as.character))

The “tidyselect” helper functions are available to assist you in specifying columns. They are detailed above in the section on Selecting and re-ordering columns, and they include: everything(), last_col(), where(), starts_with(), ends_with(), contains(), matches(), num_range() and any_of().

Here is an example of how one would change all columns to character class:

#to change all columns to character class
linelist <- linelist %>% 
  mutate(across(.cols = everything(), .fns = as.character))

Convert to character all columns where the name contains the string “date” (note the placement of commas and parentheses):

#to change all columns to character class
linelist <- linelist %>% 
  mutate(across(.cols = contains("date"), .fns = as.character))

Below, an example of mutating the columns that are currently class POSIXct (a raw datetime class that shows timestamps) - in other words, where the function is.POSIXct() evaluates to TRUE. Then we want to apply the function as.Date() to these columns to convert them to a normal class Date.

linelist <- linelist %>% 
  mutate(across(.cols = where(is.POSIXct), .fns = as.Date))
  • Note that within across() we also use the function where() as is.POSIXct is evaluating to either TRUE or FALSE.
  • Note that is.POSIXct() is from the package lubridate. Other similar “is” functions like is.character(), is.numeric(), and is.logical() are from base R

across() functions

You can read the documentation with ?across for details on how to provide functions to across(). A few summary points: there are several ways to specify the function(s) to perform on a column and you can even define your own functions:

  • You can provide the function name alone (e.g. mean or as.character)
  • You can provide the function in purrr-style (e.g. ~ mean(.x, na.rm = TRUE)) (see this page)
  • You can specify multiple functions by providing a list (e.g. list(mean = mean, n_miss = ~ sum(is.na(.x))).
    • If you provide multiple functions, multiple transformed columns will be returned per input column, with unique names in the format col_fn. You can adjust how the new columns are named with the .names = argument using glue syntax (see page on Characters and strings) where {.col} and {.fn} are shorthand for the input column and function.

Here are a few online resources on using across(): creator Hadley Wickham’s thoughts/rationale

coalesce()

This dplyr function finds the first non-missing value at each position. It “fills-in” missing values with the first available value in an order you specify.

Here is an example outside the context of a data frame: Let us say you have two vectors, one containing the patient’s village of detection and another containing the patient’s village of residence. You can use coalesce to pick the first non-missing value for each index:

village_detection <- c("a", "b", NA,  NA)
village_residence <- c("a", "c", "a", "d")

village <- coalesce(village_detection, village_residence)
village    # print
## [1] "a" "b" "a" "d"

This works the same if you provide data frame columns: for each row, the function will assign the new column value with the first non-missing value in the columns you provided (in order provided).

linelist <- linelist %>% 
  mutate(village = coalesce(village_detection, village_residence))

This is an example of a “row-wise” operation. For more complicated row-wise calculations, see the section below on Row-wise calculations.

Cumulative math

If you want a column to reflect the cumulative sum/mean/min/max etc as assessed down the rows of a dataframe to that point, use the following functions:

cumsum() returns the cumulative sum, as shown below:

sum(c(2,4,15,10))     # returns only one number
## [1] 31
cumsum(c(2,4,15,10))  # returns the cumulative sum at each step
## [1]  2  6 21 31

This can be used in a dataframe when making a new column. For example, to calculate the cumulative number of cases per day in an outbreak, consider code like this:

cumulative_case_counts <- linelist %>%  # begin with case linelist
  count(date_onset) %>%                 # count of rows per day, as column 'n'   
  mutate(cumulative_cases = cumsum(n))  # new column, of the cumulative sum at each row

Below are the first 10 rows:

head(cumulative_case_counts, 10)
##    date_onset n cumulative_cases
## 1  2012-04-15 1                1
## 2  2012-05-05 1                2
## 3  2012-05-08 1                3
## 4  2012-05-31 1                4
## 5  2012-06-02 1                5
## 6  2012-06-07 1                6
## 7  2012-06-14 1                7
## 8  2012-06-21 1                8
## 9  2012-06-24 1                9
## 10 2012-06-25 1               10

See the page on Epidemic curves for how to plot cumulative incidence with the epicurve.

See also:
cumsum(), cummean(), cummin(), cummax(), cumany(), cumall()

Using base R

To define a new column (or re-define a column) using base R, write the name of data frame, connected with $, to the new column (or the column to be modified). Use the assignment operator <- to define the new value(s). Remember that when using base R you must specify the data frame name before the column name every time (e.g. dataframe$column). Here is an example of creating the bmi column using base R:

linelist$bmi = linelist$wt_kg / (linelist$ht_cm / 100) ^ 2)

Add to pipe chain

Below, a new column is added to the pipe chain and some classes are converted.

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
    
    # remove column
    select(-c(row_num, merged_header, x28)) %>% 
  
    # de-duplicate
    distinct() %>% 
  
    # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    ###################################################
    # add new column
    mutate(bmi = wt_kg / (ht_cm/100)^2) %>% 
  
    # convert class of columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) 

8.8 Re-code values

Here are a few scenarios where you need to re-code (change) values:

  • to edit one specific value (e.g. one date with an incorrect year or format)
  • to reconcile values not spelled the same
  • to create a new column of categorical values
  • to create a new column of numeric categories (e.g. age categories)

Specific values

To change values manually you can use the recode() function within the mutate() function.

Imagine there is a nonsensical date in the data (e.g. “2014-14-15”): you could fix the date manually in the raw source data, or, you could write the change into the cleaning pipeline via mutate() and recode(). The latter is more transparent and reproducible to anyone else seeking to understand or repeat your analysis.

# fix incorrect values                   # old value       # new value
linelist <- linelist %>% 
  mutate(date_onset = recode(date_onset, "2014-14-15" = "2014-04-15"))

The mutate() line above can be read as: “mutate the column date_onset to equal the column date_onset re-coded so that OLD VALUE is changed to NEW VALUE”. Note that this pattern (OLD = NEW) for recode() is the opposite of most R patterns (new = old). The R development community is working on revising this.

Here is another example re-coding multiple values within one column.

In linelist the values in the column “hospital” must be cleaned. There are several different spellings and many missing values.

table(linelist$hospital, useNA = "always")  # print table of all unique values, including missing  
## 
##                      Central Hopital                     Central Hospital                           Hospital A                           Hospital B 
##                                   11                                  457                                  290                                  289 
##                     Military Hopital                    Military Hospital                     Mitylira Hopital                    Mitylira Hospital 
##                                   32                                  798                                    1                                   79 
##                                Other                         Port Hopital                        Port Hospital St. Mark's Maternity Hospital (SMMH) 
##                                  907                                   48                                 1756                                  417 
##   St. Marks Maternity Hopital (SMMH)                                 <NA> 
##                                   11                                 1512

The recode() command below re-defines the column “hospital” as the current column “hospital”, but with the specified recode changes. Don’t forget commas after each!

linelist <- linelist %>% 
  mutate(hospital = recode(hospital,
                     # for reference: OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Port Hopital"      = "Port Hospital",
                      "Central Hopital"   = "Central Hospital",
                      "other"             = "Other",
                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
                      ))

Now we see the spellings in the hospital column have been corrected and consolidated:

table(linelist$hospital, useNA = "always")
## 
##                     Central Hospital                           Hospital A                           Hospital B                    Military Hospital 
##                                  468                                  290                                  289                                  910 
##                                Other                        Port Hospital St. Mark's Maternity Hospital (SMMH)                                 <NA> 
##                                  907                                 1804                                  428                                 1512

TIP: The number of spaces before and after an equals sign does not matter. Make your code easier to read by aligning the = for all or most rows. Also, consider adding a hashed comment row to clarify for future readers which side is OLD and which side is NEW.

TIP: Sometimes a blank character value exists in a dataset (not recognized as R’s value for missing - NA). You can reference this value with two quotation marks with no space inbetween ("").

By logic

Below we demonstrate how to re-code values in a column using logic and conditions:

  • Using replace(), ifelse() and if_else() for simple logic
  • Using case_when() for more complex logic

Simple logic

replace()

To re-code with simple logical criteria, you can use replace() within mutate(). replace() is a function from base R. Use a logic condition to specify the rows to change . The general syntax is:

mutate(col_to_change = replace(col_to_change, criteria for rows, new value)).

One common situation to use replace() is changing just one value in one row, using an unique row identifier. Below, the gender is changed to “Female” in the row where the column case_id is “2195”.

# Example: change gender of one specific observation to "Female" 
linelist <- linelist %>% 
  mutate(gender = replace(gender, case_id == "2195", "Female"))

The equivalent command using base R syntax and indexing brackets [ ] is below. It reads as “Change the value of the dataframe linelist‘s column gender (for the rows where linelist’s column case_id has the value ’2195’) to ‘Female’”.

linelist$gender[linelist$case_id == "2195"] <- "Female"

ifelse() and if_else()

Another tool for simple logic is ifelse() and its partner if_else(). However, in most cases for re-coding it is more clear to use case_when() (detailed below). These “if else” commands are simplified versions of an if and else programming statement. The general syntax is:
ifelse(condition, value to return if condition evaluates to TRUE, value to return if condition evaluates to FALSE)

Below, the column source_known is defined. Its value in a given row is set to “known” if the row’s value in column source is not missing. If the value in source is missing, then the value in source_known is set to “unknown”.

linelist <- linelist %>% 
  mutate(source_known = ifelse(!is.na(source), "known", "unknown"))

if_else() is a special version from dplyr that handles dates. Note that if the ‘true’ value is a date, the ‘false’ value must also qualify a date, hence using the special value NA_real_ instead of just NA.

# Create a date of death column, which is NA if patient has not died.
linelist <- linelist %>% 
  mutate(date_death = if_else(outcome == "Death", date_outcome, NA_real_))

Avoid stringing together many ifelse commands… use case_when() instead! case_when() is much easier to read and you’ll make fewer errors.

Outside of the context of a data frame, if you want to have an object used in your code switch its value, consider using switch() from base R.

Complex logic

Use dplyr’s case_when() if you are re-coding into many new groups, or if you need to use complex logic statements to re-code values. This function evaluates every row in the data frame, assess whether the rows meets specified criteria, and assigns the correct new value.

case_when() commands consist of statements that have a Right-Hand Side (RHS) and a Left-Hand Side (LHS) separated by a “tilde” ~. The logic criteria are in the left side and the pursuant values are in the right side of each statement. Statements are separated by commas.

For example, here we utilize the columns age and age_unit to create a column age_years:

linelist <- linelist %>% 
  mutate(age_years = case_when(
            age_unit == "years"  ~ age,       # if age is given in years
            age_unit == "months" ~ age/12,    # if age is given in months
            is.na(age_unit)      ~ age,       # if age unit is missing, assume years
            TRUE                 ~ NA_real_)) # any other circumstance, assign missing

As each row in the data is evaluated, the criteria are applied/evaluated in the order the case_when() statements are written - from top-to-bottom. If the top criteria evaluates to TRUE for a given row, the RHS value is assigned, and the remaining criteria are not even tested for that row. Thus, it is best to write the most specific criteria first, and the most general last.

Along those lines, in your final statement, place TRUE on the left-side, which will capture any row that did not meet any of the previous criteria. The right-side of this statement could be assigned a value like “check me!” or missing.

DANGER: Vvalues on the right-side must all be the same class - either numeric, character, date, logical, etc. To assign missing (NA), you may need to use special variations of NA such as NA_character_, NA_real_ (for numeric or POSIX), and as.Date(NA). Read more in Working with dates.

Missing values

Below are special functions for handling missing values in the context of data cleaning.

See the page on Missing data for more detailed tips on identifying and handling missing values. For example, the is.na() function which logically tests for missingness.

replace_na()

To change missing values (NA) to a specific value, such as “Missing”, use the dplyr function replace_na() within mutate(). Note that this is used in the same manner as recode above - the name of the variable must be repeated within replace_na().

linelist <- linelist %>% 
  mutate(hospital = replace_na(hospital, "Missing"))

fct_explicit_na()

This is a function from the forcats package. The forcats package handles columns of class Factor. Factors are R’s way to handle ordered values such as c("First", "Second", "Third") or to set the order that values (e.g. hospitals) appear in tables and plots. See the page on Factors.

If your data are class Factor and you try to convert NA to “Missing” by using replace_na(), you will get this error: invalid factor level, NA generated. You have tried to add “Missing” as a value, when it was not defined as a possible level of the factor, and it was rejected.

The easiest way to solve this is to use the forcats function fct_explicit_na() which converts a column to class factor, and converts NA values to the character “(Missing)”.

linelist %>% 
  mutate(hospital = fct_explicit_na(hospital))

A slower alternative would be to add the factor level using fct_expand() and then convert the missing values.

na_if()

To convert a specific value to NA, use dplyr’s na_if(). The command below performs the opposite operation of replace_na(). In the example below, any values of “Missing” in the column hospital are converted to NA.

linelist <- linelist %>% 
  mutate(hospital = na_if(hospital, "Missing"))

Note: na_if() cannot be used for logic criteria (e.g. “all values > 99”) - use replace() or case_when() for this:

# Convert temperatures above 40 to NA 
linelist <- linelist %>% 
  mutate(temp = replace(temp, temp > 40, NA))

# Convert onset dates earlier than 1 Jan 2000 to missing
linelist <- linelist %>% 
  mutate(date_onset = replace(date_onset, date_onset > as.Date("2000-01-01"), NA))

Cleaning dictionary

Use the R package linelist and it’s function clean_variable_spelling() to clean a data frame with a cleaning dictionary. linelist is a package developed by RECON - the R Epidemics Consortium.

  1. Create a cleaning dictionary with 3 columns:
    • A “from” column (the incorrect value)
    • A “to” column (the correct value)
    • A column specifying the column for the changes to be applied (or “.global” to apply to all columns)

Note: .global dictionary entries will be overridden by column-specific dictionary entries.

  1. Import the dictionary file into R. This example can be downloaded via instructions on the Download handbook and data page.
cleaning_dict <- import("cleaning_dict.csv")
  1. Pass the raw linelist to clean_variable_spelling(), specifying to wordlists = the cleaning dictionary data frame. The spelling_vars = argument can be used to specify which column in the dictionary refers to the columns (3rd by default), or can be set to NULL to have the dictionary apply to all character and factor columns. Note this function can take a long time to run.
linelist <- linelist %>% 
  linelist::clean_variable_spelling(
    wordlists = cleaning_dict,
    spelling_vars = "col",        # dict column containing column names, defaults to 3rd column in dict
  )

Now scroll to the right to see how values have changed - particularly gender (lowercase to uppercase), and all the symptoms columns have been transformed from yes/no to 1/0.

Note that your column names in the cleaning dictionary must correspond to the names at this point in your cleaning script. See this online reference for the linelist package for more details.

Add to pipe chain

Below, some new columns and column transformations are added to the pipe chain.

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
    
    # remove column
    select(-c(row_num, merged_header, x28)) %>% 
  
    # de-duplicate
    distinct() %>% 
  
    # add column
    mutate(bmi = wt_kg / (ht_cm/100)^2) %>%     

    # convert class of columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) %>% 
    
    # add column: delay to hospitalisation
    mutate(days_onset_hosp = as.numeric(date_hospitalisation - date_onset)) %>% 
    
   # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
   ###################################################

    # clean values of hospital column
    mutate(hospital = recode(hospital,
                      # OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Port Hopital"      = "Port Hospital",
                      "Central Hopital"   = "Central Hospital",
                      "other"             = "Other",
                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
                      )) %>% 
    
    mutate(hospital = replace_na(hospital, "Missing")) %>% 

    # create age_years column (from age and age_unit)
    mutate(age_years = case_when(
          age_unit == "years" ~ age,
          age_unit == "months" ~ age/12,
          is.na(age_unit) ~ age,
          TRUE ~ NA_real_))

8.9 Numeric categories

Here we describe some special approaches for creating categories from numerical columns. Common examples include age categories, groups of lab values, etc. Here we will discuss:

  • age_categories(), from the epikit package
  • cut(), from base R
  • case_when()
  • quantile breaks with quantile() and ntile()

Review distribution

For this example we will create an age_cat column using the age_years column.

#check the class of the linelist variable age
class(linelist$age_years)
## [1] "numeric"

First, examine the distribution of your data, to make appropriate cut-points. See the page on ggplot basics.

# examine the distribution
hist(linelist$age_years)

summary(linelist$age_years, na.rm=T)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    6.00   13.00   16.04   23.00   84.00     107

CAUTION: Sometimes, numeric variables will import as class “character”. This occurs if there are non-numeric characters in some of the values, for example an entry of “2 months” for age, or (depending on your R locale settings) if a comma is used in the decimals place (e.g. “4,5” to mean four and one half years)..

age_categories()

With the epikit package, you can use the age_categories() function to easily categorize and label numeric columns (note: this function can be applied to non-age numeric variables too). As a bonum, the output column is automatically an ordered factor.

Here are the required inputs:

  • A numeric vector (column)
  • The breakers = argument - provide a numeric vector of break points for the new groups

First, the simplest example:

# Simple example
################
pacman::p_load(epikit)                    # load package

linelist <- linelist %>% 
  mutate(
    age_cat = age_categories(             # create new column
      age_years,                            # numeric column to make groups from
      breakers = c(0, 5, 10, 15, 20,        # break points
                   30, 40, 50, 60, 70)))

# show table
table(linelist$age_cat, useNA = "always")
## 
##   0-4   5-9 10-14 15-19 20-29 30-39 40-49 50-59 60-69   70+  <NA> 
##  1227  1223  1048   827  1216   597   251    78    27     7   107

The break values you specify are by default the lower bounds - that is, they are included in the “higher” group / the groups are “open” on the lower/left side. As shown below, you can add 1 to each break value to achieve groups that are open at the top/right.

# Include upper ends for the same categories
############################################
linelist <- linelist %>% 
  mutate(
    age_cat = age_categories(
      age_years, 
      breakers = c(0, 6, 11, 16, 21, 31, 41, 51, 61, 71)))

# show table
table(linelist$age_cat, useNA = "always")
## 
##   0-5  6-10 11-15 16-20 21-30 31-40 41-50 51-60 61-70   71+  <NA> 
##  1469  1195  1040   770  1149   547   231    70    24     6   107

You can adjust how the labels are displayed with separator =. The default is “-”

You can adjust how the top numbers are handled, with the ceiling = arguemnt. To set an upper cut-off set ceiling = TRUE. In this use, the highest break value provided is a “ceiling” and a category “XX+” is not created. Any values above highest break value (or to upper =, if defined) are categorized as NA. Below is an example with ceiling = TRUE, so that there is no category of XX+ and values above 70 (the highest break value) are assigned as NA.

# With ceiling set to TRUE
##########################
linelist <- linelist %>% 
  mutate(
    age_cat = age_categories(
      age_years, 
      breakers = c(0, 5, 10, 15, 20, 30, 40, 50, 60, 70),
      ceiling = TRUE)) # 70 is ceiling, all above become NA

# show table
table(linelist$age_cat, useNA = "always")
## 
##   0-4   5-9 10-14 15-19 20-29 30-39 40-49 50-59 60-70  <NA> 
##  1227  1223  1048   827  1216   597   251    78    28   113

Alternatively, instead of breakers =, you can provide all of lower =, upper =, and by =:

  • lower = The lowest number you want considered - default is 0
  • upper = The highest number you want considered
  • by = The number of years between groups
linelist <- linelist %>% 
  mutate(
    age_cat = age_categories(
      age_years, 
      lower = 0,
      upper = 100,
      by = 10))

# show table
table(linelist$age_cat, useNA = "always")
## 
##   0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99  100+  <NA> 
##  2450  1875  1216   597   251    78    27     6     1     0     0   107

See the function’s Help page for more details (enter ?age_categories in the R console).

cut()

cut() is a base R alternative to age_categories(), but I think you will see why age_categories() was developed to simplify this process. Some notable differences from age_categories() are:

  • You do not need to install/load another package
  • You can specify whether groups are open/closed on the right/left
  • You must provide accurate labels yourself
  • If you want 0 included in the lowest group you must specify this

The basic syntax within cut() is to first provide the numeric column to be cut (age_years), and then the breaks argument, which is a numeric vector c() of break points. Using cut(), the resulting column is an ordered factor.

By default, the categorization occurs so that the right/upper side is “open” and inclusive (and the left/lower side is “closed” or exclusive). This is the opposite behavior from the age_categories() function. The default labels use the notation “(A, B]”, which means A is not included but B is. Reverse this behavior by providing the right = TRUE argument.

Thus, by default, “0” values are excluded from the lowest group, and categorized as NA! “0” values could be infants coded as age 0 so be careful! To change this, add the argument include.lowest = TRUE so that any “0” values will be included in the lowest group. The automatically-generated label for the lowest category will then be “[A],B]”. Note that if you include the include.lowest = TRUE argument and right = TRUE, the extreme inclusion will now apply to the highest break point value and category, not the lowest.

You can provide a vector of customized labels using the labels = argument. As these are manually written, be very careful to ensure they are accurate! Check your work using cross-tabulation, as described below.

An example of cut() applied to age_years to make the new variable age_cat is below:

# Create new variable, by cutting the numeric age variable
# lower break is excluded but upper break is included in each category
linelist <- linelist %>% 
  mutate(
    age_cat = cut(
      age_years,
      breaks = c(0, 5, 10, 15, 20,
                 30, 50, 70, 100),
      include.lowest = TRUE         # include 0 in lowest group
      ))

# tabulate the number of observations per group
table(linelist$age_cat, useNA = "always")
## 
##    [0,5]   (5,10]  (10,15]  (15,20]  (20,30]  (30,50]  (50,70] (70,100]     <NA> 
##     1469     1195     1040      770     1149      778       94        6      107

Check your work!!! Verify that each age value was assigned to the correct category by cross-tabulating the numeric and category columns. Examine assignment of boundary values (e.g. 15, if neighboring categories are 10-15 and 16-20).

# Cross tabulation of the numeric and category columns. 
table("Numeric Values" = linelist$age_years,   # names specified in table for clarity.
      "Categories"     = linelist$age_cat,
      useNA = "always")                        # don't forget to examine NA values
##                     Categories
## Numeric Values       [0,5] (5,10] (10,15] (15,20] (20,30] (30,50] (50,70] (70,100] <NA>
##   0                    136      0       0       0       0       0       0        0    0
##   0.0833333333333333     1      0       0       0       0       0       0        0    0
##   0.25                   2      0       0       0       0       0       0        0    0
##   0.333333333333333      6      0       0       0       0       0       0        0    0
##   0.416666666666667      1      0       0       0       0       0       0        0    0
##   0.5                    6      0       0       0       0       0       0        0    0
##   0.583333333333333      3      0       0       0       0       0       0        0    0
##   0.666666666666667      3      0       0       0       0       0       0        0    0
##   0.75                   3      0       0       0       0       0       0        0    0
##   0.833333333333333      1      0       0       0       0       0       0        0    0
##   0.916666666666667      1      0       0       0       0       0       0        0    0
##   1                    275      0       0       0       0       0       0        0    0
##   1.5                    2      0       0       0       0       0       0        0    0
##   2                    308      0       0       0       0       0       0        0    0
##   3                    246      0       0       0       0       0       0        0    0
##   4                    233      0       0       0       0       0       0        0    0
##   5                    242      0       0       0       0       0       0        0    0
##   6                      0    241       0       0       0       0       0        0    0
##   7                      0    256       0       0       0       0       0        0    0
##   8                      0    239       0       0       0       0       0        0    0
##   9                      0    245       0       0       0       0       0        0    0
##   10                     0    214       0       0       0       0       0        0    0
##   11                     0      0     220       0       0       0       0        0    0
##   12                     0      0     224       0       0       0       0        0    0
##   13                     0      0     191       0       0       0       0        0    0
##   14                     0      0     199       0       0       0       0        0    0
##   15                     0      0     206       0       0       0       0        0    0
##   16                     0      0       0     186       0       0       0        0    0
##   17                     0      0       0     164       0       0       0        0    0
##   18                     0      0       0     141       0       0       0        0    0
##   19                     0      0       0     130       0       0       0        0    0
##   20                     0      0       0     149       0       0       0        0    0
##   21                     0      0       0       0     158       0       0        0    0
##   22                     0      0       0       0     149       0       0        0    0
##   23                     0      0       0       0     125       0       0        0    0
##   24                     0      0       0       0     144       0       0        0    0
##   25                     0      0       0       0     107       0       0        0    0
##   26                     0      0       0       0     100       0       0        0    0
##   27                     0      0       0       0     117       0       0        0    0
##   28                     0      0       0       0      85       0       0        0    0
##   29                     0      0       0       0      82       0       0        0    0
##   30                     0      0       0       0      82       0       0        0    0
##   31                     0      0       0       0       0      68       0        0    0
##   32                     0      0       0       0       0      84       0        0    0
##   33                     0      0       0       0       0      78       0        0    0
##   34                     0      0       0       0       0      58       0        0    0
##   35                     0      0       0       0       0      58       0        0    0
##   36                     0      0       0       0       0      33       0        0    0
##   37                     0      0       0       0       0      46       0        0    0
##   38                     0      0       0       0       0      45       0        0    0
##   39                     0      0       0       0       0      45       0        0    0
##   40                     0      0       0       0       0      32       0        0    0
##   41                     0      0       0       0       0      34       0        0    0
##   42                     0      0       0       0       0      26       0        0    0
##   43                     0      0       0       0       0      31       0        0    0
##   44                     0      0       0       0       0      24       0        0    0
##   45                     0      0       0       0       0      27       0        0    0
##   46                     0      0       0       0       0      25       0        0    0
##   47                     0      0       0       0       0      16       0        0    0
##   48                     0      0       0       0       0      21       0        0    0
##   49                     0      0       0       0       0      15       0        0    0
##   50                     0      0       0       0       0      12       0        0    0
##   51                     0      0       0       0       0       0      13        0    0
##   52                     0      0       0       0       0       0       7        0    0
##   53                     0      0       0       0       0       0       4        0    0
##   54                     0      0       0       0       0       0       6        0    0
##   55                     0      0       0       0       0       0       9        0    0
##   56                     0      0       0       0       0       0       7        0    0
##   57                     0      0       0       0       0       0       9        0    0
##   58                     0      0       0       0       0       0       6        0    0
##   59                     0      0       0       0       0       0       5        0    0
##   60                     0      0       0       0       0       0       4        0    0
##   61                     0      0       0       0       0       0       2        0    0
##   62                     0      0       0       0       0       0       1        0    0
##   63                     0      0       0       0       0       0       5        0    0
##   64                     0      0       0       0       0       0       1        0    0
##   65                     0      0       0       0       0       0       5        0    0
##   66                     0      0       0       0       0       0       3        0    0
##   67                     0      0       0       0       0       0       2        0    0
##   68                     0      0       0       0       0       0       1        0    0
##   69                     0      0       0       0       0       0       3        0    0
##   70                     0      0       0       0       0       0       1        0    0
##   72                     0      0       0       0       0       0       0        1    0
##   73                     0      0       0       0       0       0       0        3    0
##   76                     0      0       0       0       0       0       0        1    0
##   84                     0      0       0       0       0       0       0        1    0
##   <NA>                   0      0       0       0       0       0       0        0  107

Re-labeling NA values

You may want to assign NA values a label such as “Missing”. Because the new column is class Factor (restricted values), you cannot simply mutate it with replace_na(), as this value will be rejected. Instead, use fct_explicit_na() from forcats as explained in the Factors page.

linelist <- linelist %>% 
  
  # cut() creates age_cat, automatically of class Factor      
  mutate(age_cat = cut(
    age_years,
    breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100),          
    right = FALSE,
    include.lowest = TRUE,        
    labels = c("0-4", "5-9", "10-14", "15-19", "20-29", "30-49", "50-69", "70-100")),
         
    # make missing values explicit
    age_cat = fct_explicit_na(
      age_cat,
      na_level = "Missing age")  # you can specify the label
  )    

# table to view counts
table(linelist$age_cat, useNA = "always")
## 
##         0-4         5-9       10-14       15-19       20-29       30-49       50-69      70-100 Missing age        <NA> 
##        1227        1223        1048         827        1216         848         105           7         107           0

Quickly make breaks and labels

For a fast way to make breaks and label vectors, use something like below. See the R basics page for references on seq() and rep().

# Make break points from 0 to 90 by 5
age_seq = seq(from = 0, to = 90, by = 5)
age_seq

# Make labels for the above categories, assuming default cut() settings
age_labels = paste0(age_seq + 1, "-", age_seq + 5)
age_labels

# check that both vectors are the same length
length(age_seq) == length(age_labels)

Read more about cut() in its Help page by entering ?cut in the R console.

Quantile breaks

In common understanding, “quantiles” or “percentiles” typically refer to a value below which a proportion of values fall. For example, the 95th percentile of ages in linelist would be the age below which 95% of the age fall.

However in common speech, “quartiles” and “deciles” can also refer to the groups of data as equally divided into 4, or 10 groups (note there will be one more break point than group).

To get quantile break points, you can use quantile() from the stats package from base R. You provide a numeric vector (e.g. a column in a dataset) and vector of numeric probability values ranging from 0 to 1.0. The break points are returned as a numeric vector. Explore the details of the statistical methodologies by entering ?quantile.

  • If your input numeric vector has any missing values it is best to set na.rm = TRUE
  • Set names = FALSE to get an un-named numeric vector
quantile(linelist$age_years,               # specify numeric vector to work on
  probs = c(0, .25, .50, .75, .90, .95),   # specify the percentiles you want
  na.rm = TRUE)                            # ignore missing values 
##  0% 25% 50% 75% 90% 95% 
##   0   6  13  23  33  41

You can use the results of quantile() as break points in age_categories() or cut(). Below we create a new column deciles using cut() where the breaks are defined using quantiles() on age_years. Below, we display the results using tabyl() from janitor so you can see the percentages (see the Descriptive tables page). Note how they are not exactly 10% in each group.

linelist %>%                                # begin with linelist
  mutate(deciles = cut(age_years,           # create new column decile as cut() on column age_years
    breaks = quantile(                      # define cut breaks using quantile()
      age_years,                               # operate on age_years
      probs = seq(0, 1, by = 0.1),             # 0.0 to 1.0 by 0.1
      na.rm = TRUE),                           # ignore missing values
    include.lowest = TRUE)) %>%             # for cut() include age 0
  janitor::tabyl(deciles)                   # pipe to table to display
##  deciles   n    percent valid_percent
##    [0,2] 748 0.11319613    0.11505922
##    (2,5] 721 0.10911017    0.11090601
##    (5,7] 497 0.07521186    0.07644978
##   (7,10] 698 0.10562954    0.10736810
##  (10,13] 635 0.09609564    0.09767728
##  (13,17] 755 0.11425545    0.11613598
##  (17,21] 578 0.08746973    0.08890940
##  (21,26] 625 0.09458232    0.09613906
##  (26,33] 596 0.09019370    0.09167820
##  (33,84] 648 0.09806295    0.09967697
##     <NA> 107 0.01619249            NA

Evenly-sized groups

Another tool to make numeric groups is the the dplyr function ntile(), which attempts to break your data into n evenly-sized groups - but be aware that unlike with quantile() the same value could appear in more than one group. Provide the numeric vector and then the number of groups. The values in the new column created is just group “numbers” (e.g. 1 to 10), not the range of values themselves as when using cut().

# make groups with ntile()
ntile_data <- linelist %>% 
  mutate(even_groups = ntile(age_years, 10))

# make table of counts and proportions by group
ntile_table <- ntile_data %>% 
  janitor::tabyl(even_groups)
  
# attach min/max values to demonstrate ranges
ntile_ranges <- ntile_data %>% 
  group_by(even_groups) %>% 
  summarise(
    min = min(age_years, na.rm=T),
    max = max(age_years, na.rm=T)
  )
## Warning in min(age_years, na.rm = T): no non-missing arguments to min; returning Inf
## Warning in max(age_years, na.rm = T): no non-missing arguments to max; returning -Inf
# combine and print - note that values are present in multiple groups
left_join(ntile_table, ntile_ranges, by = "even_groups")
##  even_groups   n    percent valid_percent min  max
##            1 651 0.09851695    0.10013844   0    2
##            2 650 0.09836562    0.09998462   2    5
##            3 650 0.09836562    0.09998462   5    7
##            4 650 0.09836562    0.09998462   7   10
##            5 650 0.09836562    0.09998462  10   13
##            6 650 0.09836562    0.09998462  13   17
##            7 650 0.09836562    0.09998462  17   21
##            8 650 0.09836562    0.09998462  21   26
##            9 650 0.09836562    0.09998462  26   33
##           10 650 0.09836562    0.09998462  33   84
##           NA 107 0.01619249            NA Inf -Inf

case_when()

It is possible to use the dplyr function case_when() to create categories from a numeric column, but it is easier to use age_categories() from epikit or cut() because these will create an ordered factor automatically.

If using case_when(), please review the proper use as described earlier in the Re-code values section of this page. Also be aware that all right-hand side values must be of the same class. Thus, if you want NA on the right-side you should either write “Missing” or use the special NA value NA_character_.

Add to pipe chain

Below, code to create two categorical age columns is added to the cleaning pipe chain:

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
    
    # remove column
    select(-c(row_num, merged_header, x28)) %>% 
  
    # de-duplicate
    distinct() %>% 

    # add column
    mutate(bmi = wt_kg / (ht_cm/100)^2) %>%     

    # convert class of columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) %>% 
    
    # add column: delay to hospitalisation
    mutate(days_onset_hosp = as.numeric(date_hospitalisation - date_onset)) %>% 
    
    # clean values of hospital column
    mutate(hospital = recode(hospital,
                      # OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Port Hopital"      = "Port Hospital",
                      "Central Hopital"   = "Central Hospital",
                      "other"             = "Other",
                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
                      )) %>% 
    
    mutate(hospital = replace_na(hospital, "Missing")) %>% 

    # create age_years column (from age and age_unit)
    mutate(age_years = case_when(
          age_unit == "years" ~ age,
          age_unit == "months" ~ age/12,
          is.na(age_unit) ~ age,
          TRUE ~ NA_real_)) %>% 
  
    # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    ###################################################   
    mutate(
          # age categories: custom
          age_cat = epikit::age_categories(age_years, breakers = c(0, 5, 10, 15, 20, 30, 50, 70)),
        
          # age categories: 0 to 85 by 5s
          age_cat5 = epikit::age_categories(age_years, breakers = seq(0, 85, 5)))

8.10 Add rows

One-by-one

Adding rows one-by-one manually is tedious but can be done with add_row() from dplyr. Remember that each column must contain values of only one class (either character, numeric, logical, etc.). So adding a row requires nuance to maintain this.

linelist <- linelist %>% 
  add_row(row_num = 666,
          case_id = "abc",
          generation = 4,
          `infection date` = as.Date("2020-10-10"),
          .before = 2)

Use .before and .after. to specify the placement of the row you want to add. .before = 3 will put the new row before the current 3rd row. The default behavior is to add the row to the end. Columns not specified will be left empty (NA).

The new row number may look strange (“…23”) but the row numbers in the pre-existing rows have changed. So if using the command twice, examine/test the insertion carefully.

If a class you provide is off you will see an error like this:

Error: Can't combine ..1$infection date <date> and ..2$infection date <character>.

(when inserting a row with a date value, remember to wrap the date in the function as.Date() like as.Date("2020-10-10")).

Bind rows

To combine datasets together by binding the rows of one dataframe to the bottom of another data frame, you can use bind_rows() from dplyr. This is explained in more detail in the page Joining data.

8.11 Filter rows

A typical cleaning step after you have cleaned the columns and re-coded values is to filter the data frame for specific rows using the dplyr verb filter().

Within filter(), specify the logic that must be TRUE for a row in the dataset to be kept. Below we show how to filter rows based on simple and complex logical conditions.

Simple filter

This simple example re-defines the dataframe linelist as itself, having filtered the rows to meet a logical condition. Only the rows where the logical statement within the parentheses evaluates to TRUE are kept.

In this example, the logical statement is gender == "f", which is asking whether the value in the column gender is equal to “f” (case sensitive).

Before the filter is applied, the number of rows in linelist is nrow(linelist).

linelist <- linelist %>% 
  filter(gender == "f")   # keep only rows where gender is equal to "f"

After the filter is applied, the number of rows in linelist is linelist %>% filter(gender == "f") %>% nrow().

Filter out missing values

It is fairly common to want to filter out rows that have missing values. Resist the urge to write filter(!is.na(column) & !is.na(column)) and instead use the tidyr function that is custom-built for this purpose: drop_na(). If run with empty parentheses, it removes rows with any missing values. Alternatively, you can provide names of specific columns to be evaluated for missingness, or use the “tidyselect” helper functions described above.

linelist %>% 
  drop_na(case_id, age_years)  # drop rows with missing values for case_id or age_years

See the page on Missing data for many techniques to analyse and manage missingness in your data.

Filter by row number

In a data frame or tibble, each row will usually have a “row number” that (when seen in R Viewer) appears to the left of the first column. It is not itself a true column in the data, but it can be used in a filter() statement.

To filter based on “row number”, you can use the dplyr function row_number() with open parentheses as part of a logical filtering statement. Often you will use the %in% operator and a range of numbers as part of that logical statement, as shown below. To see the first N rows, you can also use the special dplyr function head().

# View first 100 rows
linelist %>% head(100)     # or use tail() to see the n last rows

# Show row 5 only
linelist %>% filter(row_number() == 5)

# View rows 2 through 20, and three specific columns
linelist %>% filter(row_number() %in% 2:20) %>% select(date_onset, outcome, age)

You can also convert the row numbers to a true column by piping your data frame to the tibble function rownames_to_column() (do not put anything in the parentheses).

Complex filter

More complex logical statements can be constructed using parentheses ( ), OR |, negate !, %in%, and AND & operators. An example is below:

Note: You can use the ! operator in front of a logical criteria to negate it. For example, !is.na(column) evaluates to true if the column value is not missing. Likewise !column %in% c("a", "b", "c") evaluates to true if the column value is not in the vector.

Examine the data

Below is a simple one-line command to create a histogram of onset dates. See that a second smaller outbreak from 2012-2013 is also included in this raw dataset. For our analyses, we want to remove entries from this earlier outbreak.

hist(linelist$date_onset, breaks = 50)

How filters handle missing numeric and date values

Can we just filter by date_onset to rows after June 2013? Caution! Applying the code filter(date_onset > as.Date("2013-06-01"))) would remove any rows in the later epidemic with a missing date of onset!

DANGER: Filtering to greater than (>) or less than (<) a date or number can remove any rows with missing values (NA)! This is because NA is treated as infinitely large and small.

(See the page on Working with dates for more information on working with dates and the package lubridate)

Design the filter

Examine a cross-tabulation to make sure we exclude only the correct rows:

table(Hospital  = linelist$hospital,                     # hospital name
      YearOnset = lubridate::year(linelist$date_onset),  # year of date_onset
      useNA     = "always")                              # show missing values
##                                       YearOnset
## Hospital                               2012 2013 2014 2015 <NA>
##   Central Hospital                        0    0  351   99   18
##   Hospital A                            229   46    0    0   15
##   Hospital B                            227   47    0    0   15
##   Military Hospital                       0    0  676  200   34
##   Missing                                 0    0 1117  318   77
##   Other                                   0    0  684  177   46
##   Port Hospital                           9    1 1372  347   75
##   St. Mark's Maternity Hospital (SMMH)    0    0  322   93   13
##   <NA>                                    0    0    0    0    0

What other criteria can we filter on to remove the first outbreak (in 2012 & 2013) from the dataset? We see that:

  • The first epidemic in 2012 & 2013 occurred at Hospital A, Hospital B, and that there were also 10 cases at Port Hospital.
  • Hospitals A & B did not have cases in the second epidemic, but Port Hospital did.

We want to exclude:

  • The nrow(linelist %>% filter(hospital %in% c("Hospital A", "Hospital B") | date_onset < as.Date("2013-06-01"))) rows with onset in 2012 and 2013 at either hospital A, B, or Port:
    • Exclude nrow(linelist %>% filter(date_onset < as.Date("2013-06-01"))) rows with onset in 2012 and 2013
    • Exclude nrow(linelist %>% filter(hospital %in% c('Hospital A', 'Hospital B') & is.na(date_onset))) rows from Hospitals A & B with missing onset dates
    • Do not exclude nrow(linelist %>% filter(!hospital %in% c('Hospital A', 'Hospital B') & is.na(date_onset))) other rows with missing onset dates.

We start with a linelist of nrow(linelist)`. Here is our filter statement:

linelist <- linelist %>% 
  # keep rows where onset is after 1 June 2013 OR where onset is missing and it was a hospital OTHER than Hospital A or B
  filter(date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B")))

nrow(linelist)
## [1] 6019

When we re-make the cross-tabulation, we see that Hospitals A & B are removed completely, and the 10 Port Hospital cases from 2012 & 2013 are removed, and all other values are the same - just as we wanted.

table(Hospital  = linelist$hospital,                     # hospital name
      YearOnset = lubridate::year(linelist$date_onset),  # year of date_onset
      useNA     = "always")                              # show missing values
##                                       YearOnset
## Hospital                               2014 2015 <NA>
##   Central Hospital                      351   99   18
##   Military Hospital                     676  200   34
##   Missing                              1117  318   77
##   Other                                 684  177   46
##   Port Hospital                        1372  347   75
##   St. Mark's Maternity Hospital (SMMH)  322   93   13
##   <NA>                                    0    0    0

Multiple statements can be included within one filter command (separated by commas), or you can always pipe to a separate filter() command for clarity.

Note: some readers may notice that it would be easier to just filter by date_hospitalisation because it is 100% complete with no missing values. This is true. But date_onset is used for purposes of demonstrating a complex filter.

Standalone

Filtering can also be done as a stand-alone command (not part of a pipe chain). Like other dplyr verbs, in this case the first argument must be the dataset itself.

# dataframe <- filter(dataframe, condition(s) for rows to keep)

linelist <- filter(linelist, !is.na(case_id))

You can also use base R to subset using square brackets which reflect the [rows, columns] that you want to retain.

# dataframe <- dataframe[row conditions, column conditions] (blank means keep all)

linelist <- linelist[!is.na(case_id), ]

Quickly review records

Often you want to quickly review a few records, for only a few columns. The base R function View() will print a data frame for viewing in your RStudio.

View the linelist in RStudio:

View(linelist)

Here are two examples of viewing specific cells (specific rows, and specific columns):

With dplyr functions filter() and select():

Within View(), pipe the dataset to filter() to keep certain rows, and then to select() to keep certain columns. For example, to review onset and hospitalization dates of 3 specific cases:

View(linelist %>%
       filter(case_id %in% c("11f8ea", "76b97a", "47a5f5")) %>%
       select(date_onset, date_hospitalisation))

You can achieve the same with base R syntax, using brackets [ ] to subset you want to see.

View(linelist[linelist$case_id %in% c("11f8ea", "76b97a", "47a5f5"), c("date_onset", "date_hospitalisation")])

Add to pipe chain

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################

# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
    
    # remove column
    select(-c(row_num, merged_header, x28)) %>% 
  
    # de-duplicate
    distinct() %>% 

    # add column
    mutate(bmi = wt_kg / (ht_cm/100)^2) %>%     

    # convert class of columns
    mutate(across(contains("date"), as.Date), 
           generation = as.numeric(generation),
           age        = as.numeric(age)) %>% 
    
    # add column: delay to hospitalisation
    mutate(days_onset_hosp = as.numeric(date_hospitalisation - date_onset)) %>% 
    
    # clean values of hospital column
    mutate(hospital = recode(hospital,
                      # OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Port Hopital"      = "Port Hospital",
                      "Central Hopital"   = "Central Hospital",
                      "other"             = "Other",
                      "St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
                      )) %>% 
    
    mutate(hospital = replace_na(hospital, "Missing")) %>% 

    # create age_years column (from age and age_unit)
    mutate(age_years = case_when(
          age_unit == "years" ~ age,
          age_unit == "months" ~ age/12,
          is.na(age_unit) ~ age,
          TRUE ~ NA_real_)) %>% 
  
    mutate(
          # age categories: custom
          age_cat = epikit::age_categories(age_years, breakers = c(0, 5, 10, 15, 20, 30, 50, 70)),
        
          # age categories: 0 to 85 by 5s
          age_cat5 = epikit::age_categories(age_years, breakers = seq(0, 85, 5))) %>% 
    
    # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    ###################################################
    filter(
          # keep only rows where case_id is not missing
          !is.na(case_id),  
          
          # also filter to keep only the second outbreak
          date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B")))

8.12 Row-wise calculations

If you want to perform a calculation within a row, you can use rowwise() from dplyr. See this online vignette on row-wise calculations.
For example, this code applies rowwise() and then creates a new column that sums the number of the specified symptom columns that have value “yes”, for each row in the linelist. The columns are specified within sum() by name within a vector c(). rowwise() is essentially a special kind of group_by(), so it is best to use ungroup() when you are done (page on Grouping data).

linelist %>%
  rowwise() %>%
  mutate(num_symptoms = sum(c(fever, chills, cough, aches, vomit) == "yes")) %>% 
  ungroup() %>% 
  select(fever, chills, cough, aches, vomit, num_symptoms) # for display
## # A tibble: 5,888 x 6
##    fever chills cough aches vomit num_symptoms
##    <chr> <chr>  <chr> <chr> <chr>        <int>
##  1 no    no     yes   no    yes              2
##  2 <NA>  <NA>   <NA>  <NA>  <NA>            NA
##  3 <NA>  <NA>   <NA>  <NA>  <NA>            NA
##  4 no    no     no    no    no               0
##  5 no    no     yes   no    yes              2
##  6 no    no     yes   no    yes              2
##  7 <NA>  <NA>   <NA>  <NA>  <NA>            NA
##  8 no    no     yes   no    yes              2
##  9 no    no     yes   no    yes              2
## 10 no    no     yes   no    no               1
## # ... with 5,878 more rows

As you specify the column to evaluate, you may want to use the “tidyselect” helper functions described in the select() section of this page. You just have to make one adjustment (because you are not using them within a dplyr function like select() or summarise()).

Put the column-specification criteria within the dplyr function c_across(). This is because c_across (documentation) is designed to work with rowwise() specifically. For example, the following code:

  • Applies rowwise() so the following operation (sum()) is applied within each row (not summing entire columns)
  • Creates new column num_NA_dates, defined for each row as the number of columns (with name containing “date”) for which is.na() evaluated to TRUE (they are missing data).
  • ungroup() to remove the effects of rowwise() for subsequent steps
linelist %>%
  rowwise() %>%
  mutate(num_NA_dates = sum(is.na(c_across(contains("date"))))) %>% 
  ungroup() %>% 
  select(num_NA_dates, contains("date")) # for display
## # A tibble: 5,888 x 5
##    num_NA_dates date_infection date_onset date_hospitalisation date_outcome
##           <int> <date>         <date>     <date>               <date>      
##  1            1 2014-05-08     2014-05-13 2014-05-15           NA          
##  2            1 NA             2014-05-13 2014-05-14           2014-05-18  
##  3            1 NA             2014-05-16 2014-05-18           2014-05-30  
##  4            1 2014-05-04     2014-05-18 2014-05-20           NA          
##  5            0 2014-05-18     2014-05-21 2014-05-22           2014-05-29  
##  6            0 2014-05-03     2014-05-22 2014-05-23           2014-05-24  
##  7            0 2014-05-22     2014-05-27 2014-05-29           2014-06-01  
##  8            0 2014-05-28     2014-06-02 2014-06-03           2014-06-07  
##  9            1 NA             2014-06-05 2014-06-06           2014-06-18  
## 10            1 NA             2014-06-05 2014-06-07           2014-06-09  
## # ... with 5,878 more rows

You could also provide other functions, such as max() to get the latest or most recent date for each row:

linelist %>%
  rowwise() %>%
  mutate(latest_date = max(c_across(contains("date")), na.rm=T)) %>% 
  ungroup() %>% 
  select(latest_date, contains("date"))  # for display
## # A tibble: 5,888 x 5
##    latest_date date_infection date_onset date_hospitalisation date_outcome
##    <date>      <date>         <date>     <date>               <date>      
##  1 2014-05-15  2014-05-08     2014-05-13 2014-05-15           NA          
##  2 2014-05-18  NA             2014-05-13 2014-05-14           2014-05-18  
##  3 2014-05-30  NA             2014-05-16 2014-05-18           2014-05-30  
##  4 2014-05-20  2014-05-04     2014-05-18 2014-05-20           NA          
##  5 2014-05-29  2014-05-18     2014-05-21 2014-05-22           2014-05-29  
##  6 2014-05-24  2014-05-03     2014-05-22 2014-05-23           2014-05-24  
##  7 2014-06-01  2014-05-22     2014-05-27 2014-05-29           2014-06-01  
##  8 2014-06-07  2014-05-28     2014-06-02 2014-06-03           2014-06-07  
##  9 2014-06-18  NA             2014-06-05 2014-06-06           2014-06-18  
## 10 2014-06-09  NA             2014-06-05 2014-06-07           2014-06-09  
## # ... with 5,878 more rows

8.13 Arrange and sort

Use the dplyr function arrange() to sort or order the rows by column values.

Simple list the columns in the order they should be sorted on. Specify .by_group = TRUE if you want the sorting to to first occur by any groupings applied to the data (see page on Grouping data).

By default, column will be sorted in “ascending” order (which applies to numeric and also to character columns). You can sort a variable in “descending” order by wrapping it with desc().

Sorting data with arrange() is particularly useful when making Tables for presentation, using slice() to take the “top” rows per group, or setting factor level order by order of appearance.

For example, to sort the our linelist rows by hospital, then by date_onset in descending order, we would use:

linelist %>% 
   arrange(hospital, desc(date_onset))

9 Working with dates

Working with dates in R requires more attention than working with other object classes. Below, we offer some tools and example to make this process less painful. Luckily, dates can be wrangled easily with practice, and with a set of helpful packages such as lubridate.

Upon import of raw data, R often interprets dates as character objects - this means they cannot be used for general date operations such as making time series and calculating time intervals. To make matters more difficult, there are many ways a date can be formatted and you must help R know which part of a date represents what (month, day, hour, etc.).

Dates in R are their own class of object - the Date class. It should be noted that there is also a class that stores objects with date and time. Date time objects are formally referred to as POSIXt, POSIXct, and/or POSIXlt classes (the difference isn’t important). These objects are informally referred to as datetime classes.

  • It is important to make R recognize when a column contains dates.
  • Dates are an object class and can be tricky to work with.
  • Here we present several ways to convert date columns to Date class.

9.1 Preparation

Load packages

This code chunk shows the loading of packages required for this page. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

# Checks if package is installed, installs if necessary, and loads package for current session

pacman::p_load(
  lubridate,  # general package for handling and converting dates  
  linelist,   # has function to "guess" messy dates
  aweek,      # another option for converting dates to weeks, and weeks to dates
  zoo,        # additional date/time functions
  tidyverse,  # data management and visualization  
  rio)        # data import/export

Import data

We import the dataset of cases from a simulated Ebola epidemic. If you want to download the data to follow along step-by-step, see instruction in the Download handbook and data page. We assume the file is in the working directory so no sub-folders are specified in this file path.

linelist <- import("linelist_cleaned.xlsx")

9.2 Current date

You can get the current “system” date or system datetime of your computer by doing the following with base R.

# get the system date - this is a DATE class
Sys.Date()
## [1] "2021-08-31"
# get the system time - this is a DATETIME class
Sys.time()
## [1] "2021-08-31 19:02:39 EDT"

With the lubridate package these can also be returned with today() and now(), respectively. date() returns the current date and time with weekday and month names.

9.3 Convert to Date

After importing a dataset into R, date column values may look like “1989/12/30”, “05/06/2014”, or “13 Jan 2020”. In these cases, R is likely still treating these values as Character values. R must be told that these values are dates… and what the format of the date is (which part is Day, which is Month, which is Year, etc).

Once told, R converts these values to class Date. In the background, R will store the dates as numbers (the number of days from its “origin” date 1 Jan 1970). You will not interface with the date number often, but this allows for R to treat dates as continuous variables and to allow special operations such as calculating the distance between dates.

By default, values of class Date in R are displayed as YYYY-MM-DD. Later in this section we will discuss how to change the display of date values.

Below we present two approaches to converting a column from character values to class Date.

TIP: You can check the current class of a column with base R function class(), like class(linelist$date_onset).

base R

as.Date() is the standard, base R function to convert an object or column to class Date (note capitalization of “D”).

Use of as.Date() requires that:

  • You specify the existing format of the raw character date or the origin date if supplying dates as numbers (see section on Excel dates)
  • If used on a character column, all date values must have the same exact format (if this is not the case, try guess_dates() from the linelist package)

First, check the class of your column with class() from base R. If you are unsure or confused about the class of your data (e.g. you see “POSIXct”, etc.) it can be easiest to first convert the column to class Character with as.character(), and then convert it to class Date.

Second, within the as.Date() function, use the format = argument to tell R the current format of the character date components - which characters refer to the month, the day, and the year, and how they are separated. If your values are already in one of R’s standard date formats (“YYYY-MM-DD” or “YYYY/MM/DD”) the format = argument is not necessary.

To format =, provide a character string (in quotes) that represents the current date format using the special “strptime” abbreviations below. For example, if your character dates are currently in the format “DD/MM/YYYY”, like “24/04/1968”, then you would use format = "%d/%m/%Y" to convert the values into dates. Putting the format in quotation marks is necessary. And don’t forget any slashes or dashes!

# Convert to class date
linelist <- linelist %>% 
  mutate(date_onset = as.Date(date_of_onset, format = "%d/%m/%Y"))

Most of the strptime abbreviations are listed below. You can see the complete list by running ?strptime.

%d = Day number of month (5, 17, 28, etc.)
%j = Day number of the year (Julian day 001-366)
%a = Abbreviated weekday (Mon, Tue, Wed, etc.)
%A = Full weekday (Monday, Tuesday, etc.) %w = Weekday number (0-6, Sunday is 0)
%u = Weekday number (1-7, Monday is 1)
%W = Week number (00-53, Monday is week start)
%U = Week number (01-53, Sunday is week start)
%m = Month number (e.g. 01, 02, 03, 04)
%b = Abbreviated month (Jan, Feb, etc.)
%B = Full month (January, February, etc.)
%y = 2-digit year (e.g. 89)
%Y = 4-digit year (e.g. 1989)
%h = hours (24-hr clock)
%m = minutes
%s = seconds %z = offset from GMT
%Z = Time zone (character)

TIP: The format = argument of as.Date() is not telling R the format you want the dates to be, but rather how to identify the date parts as they are before you run the command.

TIP: Be sure that in the format = argument you use the date-part separator (e.g. /, -, or space) that is present in your dates.

Once the values are in class Date, R will by default display them in the standard format, which is YYYY-MM-DD.

lubridate

Converting character objects to dates can be made easier by using the lubridate package. This is a tidyverse package designed to make working with dates and times more simple and consistent than in base R. For these reasons, lubridate is often considered the gold-standard package for dates and time, and is recommended whenever working with them.

The lubridate package provides several different helper functions designed to convert character objects to dates in an intuitive, and more lenient way than specifying the format in as.Date(). These functions are specific to the rough date format, but allow for a variety of separators, and synonyms for dates (e.g. 01 vs Jan vs January) - they are named after abbreviations of date formats.

# install/load lubridate 
pacman::p_load(lubridate)

The ymd() function flexibly converts date values supplied as year, then month, then day.

# read date in year-month-day format
ymd("2020-10-11")
## [1] "2020-10-11"
ymd("20201011")
## [1] "2020-10-11"

The mdy() function flexibly converts date values supplied as month, then day, then year.

# read date in month-day-year format
mdy("10/11/2020")
## [1] "2020-10-11"
mdy("Oct 11 20")
## [1] "2020-10-11"

The dmy() function flexibly converts date values supplied as day, then month, then year.

# read date in day-month-year format
dmy("11 10 2020")
## [1] "2020-10-11"
dmy("11 October 2020")
## [1] "2020-10-11"

If using piping, the conversion of a character column to dates with lubridate might look like this:

linelist <- linelist %>%
  mutate(date_onset = lubridate::dmy(date_onset))

Once complete, you can run class() to verify the class of the column

# Check the class of the column
class(linelist$date_onset)  

Once the values are in class Date, R will by default display them in the standard format, which is YYYY-MM-DD.

Note that the above functions work best with 4-digit years. 2-digit years can produce unexpected results, as lubridate attempts to guess the century.

To convert a 2-digit year into a 4-digit year (all in the same century) you can convert to class character and then combine the existing digits with a pre-fix using str_glue() from the stringr package (see Characters and strings). Then convert to date.

two_digit_years <- c("15", "15", "16", "17")
str_glue("20{two_digit_years}")
## 2015
## 2015
## 2016
## 2017

Combine columns

You can use the lubridate functions make_date() and make_datetime() to combine multiple numeric columns into one date column. For example if you have numeric columns onset_day, onset_month, and onset_year in the data frame linelist:

linelist <- linelist %>% 
  mutate(onset_date = make_date(year = onset_year, month = onset_month, day = onset_day))

9.4 Excel dates

In the background, most software store dates as numbers. R stores dates from an origin of 1st January, 1970. Thus, if you run as.numeric(as.Date("1970-01-01)) you will get 0.

Microsoft Excel stores dates with an origin of either December 30, 1899 (Windows) or January 1, 1904 (Mac), depending on your operating system. See this Microsoft guidance for more information.

Excel dates often import into R as these numeric values instead of as characters. If the dataset you imported from Excel shows dates as numbers or characters like “41369”… use as.Date() (or lubridate’s as_date() function) to convert, but instead of supplying a “format” as above, supply the Excel origin date to the argument origin =.

This will not work if the Excel date is stored in R as a character type, so be sure to ensure the number is class Numeric!

NOTE: You should provide the origin date in R’s default date format (“YYYY-MM-DD”).

# An example of providing the Excel 'origin date' when converting Excel number dates
data_cleaned <- data %>% 
  mutate(date_onset = as.numeric(date_onset)) %>%   # ensure class is numeric
  mutate(date_onset = as.Date(date_onset, origin = "1899-12-30")) # convert to date using Excel origin

9.5 Messy dates

The function guess_dates() from the linelist package attempts to read a “messy” date column containing dates in many different formats and convert the dates to a standard format. You can read more online about guess_dates(). If guess_dates() is not yet available on CRAN for R 4.0.2, try install via pacman::p_load_gh("reconhub/linelist").

For example guess_dates would see a vector of the following character dates “03 Jan 2018”, “07/03/1982”, and “08/20/85” and convert them to class Date as: 2018-01-03, 1982-03-07, and 1985-08-20.

linelist::guess_dates(c("03 Jan 2018",
                        "07/03/1982",
                        "08/20/85"))
## [1] "2018-01-03" "1982-03-07" "1985-08-20"

Some optional arguments for guess_dates() that you might include are:

  • error_tolerance - The proportion of entries which cannot be identified as dates to be tolerated (defaults to 0.1 or 10%)
  • last_date - the last valid date (defaults to current date)
  • first_date - the first valid date. Defaults to fifty years before the last_date.
# An example using guess_dates on the column dater_onset
linelist <- linelist %>%                 # the dataset is called linelist
  mutate(
    date_onset = linelist::guess_dates(  # the guess_dates() from package "linelist"
      date_onset,
      error_tolerance = 0.1,
      first_date = "2016-01-01"
    )

9.6 Working with date-time class

As previously mentioned, R also supports a datetime class - a column that contains date and time information. As with the Date class, these often need to be converted from character objects to datetime objects.

Convert dates with times

A standard datetime object is formatted with the date first, which is followed by a time component - for example 01 Jan 2020, 16:30. As with dates, there are many ways this can be formatted, and there are numerous levels of precision (hours, minutes, seconds) that can be supplied.

Luckily, lubridate helper functions also exist to help convert these strings to datetime objects. These functions are extensions of the date helper functions, with _h (only hours supplied), _hm (hours and minutes supplied), or _hms (hours, minutes, and seconds supplied) appended to the end (e.g. dmy_hms()). These can be used as shown:

Convert datetime with only hours to datetime object

ymd_h("2020-01-01 16hrs")
## [1] "2020-01-01 16:00:00 UTC"
ymd_h("2020-01-01 4PM")
## [1] "2020-01-01 16:00:00 UTC"

Convert datetime with hours and minutes to datetime object

dmy_hm("01 January 2020 16:20")
## [1] "2020-01-01 16:20:00 UTC"

Convert datetime with hours, minutes, and seconds to datetime object

mdy_hms("01 January 2020, 16:20:40")
## [1] "2020-01-20 16:20:40 UTC"

You can supply time zone but it is ignored. See section later in this page on time zones.

mdy_hms("01 January 2020, 16:20:40 PST")
## [1] "2020-01-20 16:20:40 UTC"

When working with a data frame, time and date columns can be combined to create a datetime column using str_glue() from stringr package and an appropriate lubridate function. See the page on Characters and strings for details on stringr.

In this example, the linelist data frame has a column in format “hours:minutes”. To convert this to a datetime we follow a few steps:

  1. Create a “clean” time of admission column with missing values filled-in with the column median. We do this because lubridate won’t operate on missing values. Combine it with the column date_hospitalisation, and then use the function ymd_hm() to convert.
# packages
pacman::p_load(tidyverse, lubridate, stringr)

# time_admission is a column in hours:minutes
linelist <- linelist %>%
  
  # when time of admission is not given, assign the median admission time
  mutate(
    time_admission_clean = ifelse(
      is.na(time_admission),         # if time is missing
      median(time_admission),        # assign the median
      time_admission                 # if not missing keep as is
  ) %>%
  
    # use str_glue() to combine date and time columns to create one character column
    # and then use ymd_hm() to convert it to datetime
  mutate(
    date_time_of_admission = str_glue("{date_hospitalisation} {time_admission_clean}") %>% 
      ymd_hm()
  )

Convert times alone

If your data contain only a character time (hours and minutes), you can convert and manipulate them as times using strptime() from base R. For example, to get the difference between two of these times:

# raw character times
time1 <- "13:45" 
time2 <- "15:20"

# Times converted to a datetime class
time1_clean <- strptime(time1, format = "%H:%M")
time2_clean <- strptime(time2, format = "%H:%M")

# Difference is of class "difftime" by default, here converted to numeric hours 
as.numeric(time2_clean - time1_clean)   # difference in hours
## [1] 1.583333

Note however that without a date value provided, it assumes the date is today. To combine a string date and a string time together see how to use stringr in the section just above. Read more about strptime() here.

To convert single-digit numbers to double-digits (e.g. to “pad” hours or minutes with leading zeros to achieve 2 digits), see this “Pad length” section of the Characters and strings page.

Extract time

You can extract elements of a time with hour(), minute(), or second() from lubridate.

Here is an example of extracting the hour, and then classifing by part of the day. We begin with the column time_admission, which is class Character in format “HH:MM”. First, the strptime() is used as described above to convert the characters to datetime class. Then, the hour is extracted with hour(), returning a number from 0-24. Finally, a column time_period is created using logic with case_when() to classify rows into Morning/Afternoon/Evening/Night based on their hour of admission.

linelist <- linelist %>%
  mutate(hour_admit = hour(strptime(time_admission, format = "%H:%M"))) %>%
  mutate(time_period = case_when(
    hour_admit > 06 & hour_admit < 12 ~ "Morning",
    hour_admit >= 12 & hour_admit < 17 ~ "Afternoon",
    hour_admit >= 17 & hour_admit < 21 ~ "Evening",
    hour_admit >=21 | hour_admit <= 6 ~ "Night"))

To learn more about case_when() see the page on Cleaning data and core functions.

9.7 Working with dates

lubridate can also be used for a variety of other functions, such as extracting aspects of a date/datetime, performing date arithmetic, or calculating date intervals

Here we define a date to use for the examples:

# create object of class Date
example_date <- ymd("2020-03-01")

Extract date components

You can extract common aspects such as month, day, weekday:

month(example_date)  # month number
## [1] 3
day(example_date)    # day (number) of the month
## [1] 1
wday(example_date)   # day number of the week (1-7)
## [1] 1

You can also extract time components from a datetime object or column. This can be useful if you want to view the distribution of admission times.

example_datetime <- ymd_hm("2020-03-01 14:45")

hour(example_datetime)     # extract hour
minute(example_datetime)   # extract minute
second(example_datetime)   # extract second

There are several options to retrieve weeks. See the section on Epidemiological weeks below.

Note that if you are seeking to display a date a certain way (e.g. “Jan 2020” or “Thursday 20 March” or “Week 20, 1977”) you can do this more flexibly as described in the section on Date display.

Date math

You can add certain numbers of days or weeks using their respective function from lubridate.

# add 3 days to this date
example_date + days(3)
## [1] "2020-03-04"
# add 7 weeks and subtract two days from this date
example_date + weeks(7) - days(2)
## [1] "2020-04-17"

Date intervals

The difference between dates can be calculated by:

  1. Ensure both dates are of class date
  2. Use subtraction to return the “difftime” difference between the two dates
  3. If necessary, convert the result to numeric class to perform subsequent mathematical calculations

Below the interval between two dates is calculated and displayed. You can find intervals by using the subtraction “minus” symbol on values that are class Date. Note, however that the class of the returned value is “difftime” as displayed below, and must be converted to numeric.

# find the interval between this date and Feb 20 2020 
output <- example_date - ymd("2020-02-20")
output    # print
## Time difference of 10 days
class(output)
## [1] "difftime"

To do subsequent operations on a “difftime”, convert it to numeric with as.numeric().

This can all be brought together to work with data - for example:

pacman::p_load(lubridate, tidyverse)   # load packages

linelist <- linelist %>%
  
  # convert date of onset from character to date objects by specifying dmy format
  mutate(date_onset = dmy(date_onset),
         date_hospitalisation = dmy(date_hospitalisation)) %>%
  
  # filter out all cases without onset in march
  filter(month(date_onset) == 3) %>%
    
  # find the difference in days between onset and hospitalisation
  mutate(days_onset_to_hosp = date_hospitalisation - date_of_onset)

In a data frame context, if either of the above dates is missing, the operation will fail for that row. This will result in an NA instead of a numeric value. When using this column for calculations, be sure to set the na.rm = argument to TRUE. For example:

# calculate the median number of days to hospitalisation for all cases where data are available
median(linelist_delay$days_onset_to_hosp, na.rm = T)

9.8 Date display

Once dates are the correct class, you often want them to display differently, for example to display as “Monday 05 January” instead of “2018-01-05”. You may also want to adjust the display in order to then group rows by the date elements displayed - for example to group by month-year.

format()

Adjust date display with the base R function format(). This function accepts a character string (in quotes) specifying the desired output format in the “%” strptime abbreviations (the same syntax as used in as.Date()). Below are most of the common abbreviations.

Note: using format() will convert the values to class Character, so this is generally used towards the end of an analysis or for display purposes only! You can see the complete list by running ?strptime.

%d = Day number of month (5, 17, 28, etc.)
%j = Day number of the year (Julian day 001-366)
%a = Abbreviated weekday (Mon, Tue, Wed, etc.)
%A = Full weekday (Monday, Tuesday, etc.)
%w = Weekday number (0-6, Sunday is 0)
%u = Weekday number (1-7, Monday is 1)
%W = Week number (00-53, Monday is week start)
%U = Week number (01-53, Sunday is week start)
%m = Month number (e.g. 01, 02, 03, 04)
%b = Abbreviated month (Jan, Feb, etc.)
%B = Full month (January, February, etc.)
%y = 2-digit year (e.g. 89)
%Y = 4-digit year (e.g. 1989)
%h = hours (24-hr clock)
%m = minutes
%s = seconds
%z = offset from GMT
%Z = Time zone (character)

An example of formatting today’s date:

# today's date, with formatting
format(Sys.Date(), format = "%d %B %Y")
## [1] "31 August 2021"
# easy way to get full date and time (default formatting)
date()
## [1] "Tue Aug 31 19:02:40 2021"
# formatted combined date, time, and time zone using str_glue() function
str_glue("{format(Sys.Date(), format = '%A, %B %d %Y, %z  %Z, ')}{format(Sys.time(), format = '%H:%M:%S')}")
## Tuesday, August 31 2021, +0000  UTC, 19:02:40
# Using format to display weeks
format(Sys.Date(), "%Y Week %W")
## [1] "2021 Week 35"

Note that if using str_glue(), be aware of that within the expected double quotes " you should only use single quotes (as above).

Month-Year

To convert a Date column to Month-year format, we suggest you use the function as.yearmon() from the zoo package. This converts the date to class “yearmon” and retains the proper ordering. In contrast, using format(column, "%Y %B") will convert to class Character and will order the values alphabetically (incorrectly).

Below, a new column yearmonth is created from the column date_onset, using the as.yearmon() function. The default (correct) ordering of the resulting values are shown in the table.

# create new column 
test_zoo <- linelist %>% 
     mutate(yearmonth = zoo::as.yearmon(date_onset))

# print table
table(test_zoo$yearmon)
## 
## Apr 2014 May 2014 Jun 2014 Jul 2014 Aug 2014 Sep 2014 Oct 2014 Nov 2014 Dec 2014 Jan 2015 Feb 2015 Mar 2015 Apr 2015 
##        7       64      100      226      528     1070     1112      763      562      431      306      277      186

In contrast, you can see how only using format() does achieve the desired display format, but not the correct ordering.

# create new column
test_format <- linelist %>% 
     mutate(yearmonth = format(date_onset, "%b %Y"))

# print table
table(test_format$yearmon)
## 
## Apr 2014 Apr 2015 Aug 2014 Dec 2014 Feb 2015 Jan 2015 Jul 2014 Jun 2014 Mar 2015 May 2014 Nov 2014 Oct 2014 Sep 2014 
##        7      186      528      562      306      431      226      100      277       64      763     1112     1070

Note: if you are working within a ggplot() and want to adjust how dates are displayed only, it may be sufficient to provide a strptime format to the date_labels = argument in scale_x_date() - you can use "%b %Y" or "%Y %b". See the ggplot tips page.

zoo also offers the function as.yearqtr(), and you can use scale_x_yearmon() when using ggplot().

9.9 Epidemiological weeks

lubridate

See the page on Grouping data for more extensive examples of grouping data by date. Below we briefly describe grouping data by weeks.

We generally recommend using the floor_date() function from lubridate, with the argument unit = "week". This rounds the date down to the “start” of the week, as defined by the argument week_start =. The default week start is 1 (for Mondays) but you can specify any day of the week as the start (e.g. 7 for Sundays). floor_date() is versitile and can be used to round down to other time units by setting unit = to “second”, “minute”, “hour”, “day”, “month”, or “year”.

The returned value is the start date of the week, in Date class. Date class is useful when plotting the data, as it will be easily recognized and ordered correctly by ggplot().

If you are only interested in adjusting dates to display by week in a plot, see the section in this page on Date display. For example when plotting an epicurve you can format the date display by providing the desired strptime “%” nomenclature. For example, use “%Y-%W” or “%Y-%U” to return the year and week number (given Monday or Sunday week start, respectively).

Weekly counts

See the page on Grouping data for a thorough explanation of grouping data with count(), group_by(), and summarise(). A brief example is below.

  1. Create a new ‘week’ column with mutate(), using floor_date() with unit = "week"
  2. Get counts of rows (cases) per week with count(); filter out any cases with missing date
  3. Finish with complete() from tidyr to ensure that all weeks appear in the data - even those with no rows/cases. By default the count values for any “new” rows are NA, but you can make them 0 with the fill = argument, which expects a named list (below, n is the name of the counts column).
# Make aggregated dataset of weekly case counts
weekly_counts <- linelist %>% 
  drop_na(date_onset) %>%             # remove cases missing onset date
  mutate(weekly_cases = floor_date(   # make new column, week of onset
    date_onset,
    unit = "week")) %>%            
  count(weekly_cases) %>%           # group data by week and count rows per group (creates column 'n')
  tidyr::complete(                  # ensure all weeks are present, even those with no cases reported
    weekly_cases = seq.Date(          # re-define the "weekly_cases" column as a complete sequence,
      from = min(weekly_cases),       # from the minimum date
      to = max(weekly_cases),         # to the maxiumum date
      by = "week"),                   # by weeks
    fill = list(n = 0))             # fill-in NAs in the n counts column with 0

Here are the first rows of the resulting data frame:

Epiweek alternatives

Note that lubridate also has functions week(), epiweek(), and isoweek(), each of which has slightly different start dates and other nuances. Generally speaking though, floor_date() should be all that you need. Read the details for these functions by entering ?week into the console or reading the documentation here.

You might consider using the package aweek to set epidemiological weeks. You can read more about it on the RECON website. It has the functions date2week() and week2date() in which you can set the week start day with week_start = "Monday". This package is easiest if you want “week”-style outputs (e.g. “2020-W12”). Another advantage of aweek is that when date2week() is applied to a date column, the returned column (week format) is automatically of class Factor and includes levels for all weeks in the time span (this avoids the extra step of complete() described above). However, aweek does not have the functionality to round dates to other time units such as months, years, etc.

Another alternative for time series which also works well to show a a “week” format (“2020 W12”) is yearweek() from the package tsibble, as demonstrated in the page on Time series and outbreak detection.

9.10 Converting dates/time zones

When data is present in different time time zones, it can often be important to standardise this data in a unified time zone. This can present a further challenge, as the time zone component of data must be coded manually in most cases.

In R, each datetime object has a timezone component. By default, all datetime objects will carry the local time zone for the computer being used - this is generally specific to a location rather than a named timezone, as time zones will often change in locations due to daylight savings time. It is not possible to accurately compensate for time zones without a time component of a date, as the event a date column represents cannot be attributed to a specific time, and therefore time shifts measured in hours cannot be reasonably accounted for.

To deal with time zones, there are a number of helper functions in lubridate that can be used to change the time zone of a datetime object from the local time zone to a different time zone. Time zones are set by attributing a valid tz database time zone to the datetime object. A list of these can be found here - if the location you are using data from is not on this list, nearby large cities in the time zone are available and serve the same purpose.

https://en.wikipedia.org/wiki/List_of_tz_database_time_zones

# assign the current time to a column
time_now <- Sys.time()
time_now
## [1] "2021-08-31 19:02:40 EDT"
# use with_tz() to assign a new timezone to the column, while CHANGING the clock time
time_london_real <- with_tz(time_now, "Europe/London")

# use force_tz() to assign a new timezone to the column, while KEEPING the clock time
time_london_local <- force_tz(time_now, "Europe/London")


# note that as long as the computer that was used to run this code is NOT set to London time,
# there will be a difference in the times 
# (the number of hours difference from the computers time zone to london)
time_london_real - time_london_local
## Time difference of 5 hours

This may seem largely abstract, and is often not needed if the user isn’t working across time zones.

9.11 Lagging and leading calculations

lead() and lag() are functions from the dplyr package which help find previous (lagged) or subsequent (leading) values in a vector - typically a numeric or date vector. This is useful when doing calculations of change/difference between time units.

Let’s say you want to calculate the difference in cases between a current week and the previous one. The data are initially provided in weekly counts as shown below.

When using lag() or lead() the order of rows in the dataframe is very important! - pay attention to whether your dates/numbers are ascending or descending

First, create a new column containing the value of the previous (lagged) week.

  • Control the number of units back/forward with n = (must be a non-negative integer)
  • Use default = to define the value placed in non-existing rows (e.g. the first row for which there is no lagged value). By default this is NA.
  • Use order_by = TRUE if your the rows are not ordered by your reference column
counts <- counts %>% 
  mutate(cases_prev_wk = lag(cases_wk, n = 1))

Next, create a new column which is the difference between the two cases columns:

counts <- counts %>% 
  mutate(cases_prev_wk = lag(cases_wk, n = 1),
         case_diff = cases_wk - cases_prev_wk)

You can read more about lead() and lag() in the documentation here or by entering ?lag in your console.

9.12 Resources

lubridate tidyverse page
lubridate RStudio cheatsheet
R for Data Science page on dates and times
Online tutorial Date formats

10 Characters and strings

This page demonstrates use of the stringr package to evaluate and handle character values (“strings”).

  1. Combine, order, split, arrange - str_c(), str_glue(), str_order(), str_split()
  2. Clean and standardise
    • Adjust length - str_pad(), str_trunc(), str_wrap()
    • Change case - str_to_upper(), str_to_title(), str_to_lower(), str_to_sentence()
  3. Evaluate and extract by position - str_length(), str_sub(), word()
  4. Patterns
    • Detect and locate - str_detect(), str_subset(), str_match(), str_extract()
    • Modify and replace - str_sub(), str_replace_all()
  5. Regular expressions (“regex”)

For ease of display most examples are shown acting on a short defined character vector, however they can easily be adapted to a column within a data frame.

This stringr vignette provided much of the inspiration for this page.

10.1 Preparation

Load packages

Install or load the stringr and other tidyverse packages.

# install/load packages
pacman::p_load(
  stringr,    # many functions for handling strings
  tidyverse,  # for optional data manipulation
  tools)      # alternative for converting to title case

Import data

In this page we will occassionally reference the cleaned linelist of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import data with the import() function from the rio package (it handles many file types like .xlsx, .csv, .rds - see the Import and export page for details).

# import case linelist 
linelist <- import("linelist_cleaned.rds")

The first 50 rows of the linelist are displayed below.

10.2 Unite, split, and arrange

This section covers:

  • Using str_c(), str_glue(), and unite() to combine strings
  • Using str_order() to arrange strings
  • Using str_split() and separate() to split strings

Combine strings

To combine or concatenate multiple strings into one string, we suggest using str_c from stringr. If you have distinct character values to combine, simply provide them as unique arguments, separated by commas.

str_c("String1", "String2", "String3")
## [1] "String1String2String3"

The argument sep = inserts a character value between each of the arguments you provided (e.g. inserting a comma, space, or newline "\n")

str_c("String1", "String2", "String3", sep = ", ")
## [1] "String1, String2, String3"

The argument collapse = is relevant if you are inputting multiple vectors as arguments to str_c(). It is used to separate the elements of what would be an output vector, such that the output vector only has one long character element.

The example below shows the combination of two vectors into one (first names and last names). Another similar example might be jurisdictions and their case counts. In this example:

  • The sep = value appears between each first and last name
  • The collapse = value appears between each person
first_names <- c("abdul", "fahruk", "janice") 
last_names  <- c("hussein", "akinleye", "okeke")

# sep displays between the respective input strings, while collapse displays between the elements produced
str_c(first_names, last_names, sep = " ", collapse = ";  ")
## [1] "abdul hussein;  fahruk akinleye;  janice okeke"

Note: Depending on your desired display context, when printing such a combined string with newlines, you may need to wrap the whole phrase in cat() for the newlines to print properly:

# For newlines to print correctly, the phrase may need to be wrapped in cat()
cat(str_c(first_names, last_names, sep = " ", collapse = ";\n"))
## abdul hussein;
## fahruk akinleye;
## janice okeke

Dynamic strings

Use str_glue() to insert dynamic R code into a string. This is a very useful function for creating dynamic plot captions, as demonstrated below.

  • All content goes between double quotation marks str_glue("")
  • Any dynamic code or references to pre-defined values are placed within curly brackets {} within the double quotation marks. There can be many curly brackets in the same str_glue() command.
  • To display character quotes ’’, use single quotes within the surrounding double quotes (e.g. when providing date format - see example below)
  • Tip: You can use \n to force a new line
  • Tip: You use format() to adjust date display, and use Sys.Date() to display the current date

A simple example, of a dynamic plot caption:

str_glue("Data include {nrow(linelist)} cases and are current to {format(Sys.Date(), '%d %b %Y')}.")
## Data include 5888 cases and are current to 31 Aug 2021.

An alternative format is to use placeholders within the brackets and define the code in separate arguments at the end of the str_glue() function, as below. This can improve code readability if the text is long.

str_glue("Linelist as of {current_date}.\nLast case hospitalized on {last_hospital}.\n{n_missing_onset} cases are missing date of onset and not shown",
         current_date = format(Sys.Date(), '%d %b %Y'),
         last_hospital = format(as.Date(max(linelist$date_hospitalisation, na.rm=T)), '%d %b %Y'),
         n_missing_onset = nrow(linelist %>% filter(is.na(date_onset)))
         )
## Linelist as of 31 Aug 2021.
## Last case hospitalized on 30 Apr 2015.
## 256 cases are missing date of onset and not shown

Pulling from a data frame

Sometimes, it is useful to pull data from a data frame and have it pasted together in sequence. Below is an example data frame. We will use it to to make a summary statement about the jurisdictions and the new and total case counts.

# make case data frame
case_table <- data.frame(
  zone        = c("Zone 1", "Zone 2", "Zone 3", "Zone 4", "Zone 5"),
  new_cases   = c(3, 0, 7, 0, 15),
  total_cases = c(40, 4, 25, 10, 103)
  )

Use str_glue_data(), which is specially made for taking data from data frame rows:

case_table %>% 
  str_glue_data("{zone}: {new_cases} ({total_cases} total cases)")
## Zone 1: 3 (40 total cases)
## Zone 2: 0 (4 total cases)
## Zone 3: 7 (25 total cases)
## Zone 4: 0 (10 total cases)
## Zone 5: 15 (103 total cases)

Combine strings across rows

If you are trying to “roll-up” values in a data frame column, e.g. combine values from multiple rows into just one row by pasting them together with a separator, see the section of the De-duplication page on “rolling-up” values.

Data frame to one line

You can make the statement appear in one line using str_c() (specifying the data frame and column names), and providing sep = and collapse = arguments.

str_c(case_table$zone, case_table$new_cases, sep = " = ", collapse = ";  ")
## [1] "Zone 1 = 3;  Zone 2 = 0;  Zone 3 = 7;  Zone 4 = 0;  Zone 5 = 15"

You could add the pre-fix text “New Cases:” to the beginning of the statement by wrapping with a separate str_c() (if “New Cases:” was within the original str_c() it would appear multiple times).

str_c("New Cases: ", str_c(case_table$zone, case_table$new_cases, sep = " = ", collapse = ";  "))
## [1] "New Cases: Zone 1 = 3;  Zone 2 = 0;  Zone 3 = 7;  Zone 4 = 0;  Zone 5 = 15"

Unite columns

Within a data frame, bringing together character values from multiple columns can be achieved with unite() from tidyr. This is the opposite of separate().

Provide the name of the new united column. Then provide the names of the columns you wish to unite.

  • By default, the separator used in the united column is underscore _, but this can be changed with the sep = argument.
  • remove = removes the input columns from the data frame (TRUE by default)
  • na.rm = removes missing values while uniting (FALSE by default)

Below, we define a mini-data frame to demonstrate with:

df <- data.frame(
  case_ID = c(1:6),
  symptoms  = c("jaundice, fever, chills",     # patient 1
                "chills, aches, pains",        # patient 2 
                "fever",                       # patient 3
                "vomiting, diarrhoea",         # patient 4
                "bleeding from gums, fever",   # patient 5
                "rapid pulse, headache"),      # patient 6
  outcome = c("Recover", "Death", "Death", "Recover", "Recover", "Recover"))
df_split <- separate(df, symptoms, into = c("sym_1", "sym_2", "sym_3"), extra = "merge")
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [3, 4].

Here is the example data frame:

Below, we unite the three symptom columns:

df_split %>% 
  unite(
    col = "all_symptoms",         # name of the new united column
    c("sym_1", "sym_2", "sym_3"), # columns to unite
    sep = ", ",                   # separator to use in united column
    remove = TRUE,                # if TRUE, removes input cols from the data frame
    na.rm = TRUE                  # if TRUE, missing values are removed before uniting
  )
##   case_ID                all_symptoms outcome
## 1       1     jaundice, fever, chills Recover
## 2       2        chills, aches, pains   Death
## 3       3                       fever   Death
## 4       4         vomiting, diarrhoea Recover
## 5       5 bleeding, from, gums, fever Recover
## 6       6      rapid, pulse, headache Recover

Split

To split a string based on a pattern, use str_split(). It evaluates the string(s) and returns a list of character vectors consisting of the newly-split values.

The simple example below evaluates one string and splits it into three. By default it returns an object of class list with one element (a character vector) for each string initially provided. If simplify = TRUE it returns a character matrix.

In this example, one string is provided, and the function returns a list with one element - a character vector with three values.

str_split(string = "jaundice, fever, chills",
          pattern = ",")
## [[1]]
## [1] "jaundice" " fever"   " chills"

If the output is saved, you can then access the nth split value with bracket syntax. To access a specific value you can use syntax like this: the_returned_object[[1]][2], which would access the second value from the first evaluated string (“fever”). See the R basics page for more detail on accessing elements.

pt1_symptoms <- str_split("jaundice, fever, chills", ",")

pt1_symptoms[[1]][2]  # extracts 2nd value from 1st (and only) element of the list
## [1] " fever"

If multiple strings are provided by str_split(), there will be more than one element in the returned list.

symptoms <- c("jaundice, fever, chills",     # patient 1
              "chills, aches, pains",        # patient 2 
              "fever",                       # patient 3
              "vomiting, diarrhoea",         # patient 4
              "bleeding from gums, fever",   # patient 5
              "rapid pulse, headache")       # patient 6

str_split(symptoms, ",")                     # split each patient's symptoms
## [[1]]
## [1] "jaundice" " fever"   " chills" 
## 
## [[2]]
## [1] "chills" " aches" " pains"
## 
## [[3]]
## [1] "fever"
## 
## [[4]]
## [1] "vomiting"   " diarrhoea"
## 
## [[5]]
## [1] "bleeding from gums" " fever"            
## 
## [[6]]
## [1] "rapid pulse" " headache"

To return a “character matrix” instead, which may be useful if creating data frame columns, set the argument simplify = TRUE as shown below:

str_split(symptoms, ",", simplify = TRUE)
##      [,1]                 [,2]         [,3]     
## [1,] "jaundice"           " fever"     " chills"
## [2,] "chills"             " aches"     " pains" 
## [3,] "fever"              ""           ""       
## [4,] "vomiting"           " diarrhoea" ""       
## [5,] "bleeding from gums" " fever"     ""       
## [6,] "rapid pulse"        " headache"  ""

You can also adjust the number of splits to create with the n = argument. For example, this restricts the number of splits to 2. Any further commas remain within the second values.

str_split(symptoms, ",", simplify = TRUE, n = 2)
##      [,1]                 [,2]            
## [1,] "jaundice"           " fever, chills"
## [2,] "chills"             " aches, pains" 
## [3,] "fever"              ""              
## [4,] "vomiting"           " diarrhoea"    
## [5,] "bleeding from gums" " fever"        
## [6,] "rapid pulse"        " headache"

Note - the same outputs can be achieved with str_split_fixed(), in which you do not give the simplify argument, but must instead designate the number of columns (n).

str_split_fixed(symptoms, ",", n = 2)

Split columns

If you are trying to split data frame column, it is best to use the separate() function from dplyr. It is used to split one character column into other columns.

Let’s say we have a simple data frame df (defined and united in the unite section) containing a case_ID column, one character column with many symptoms, and one outcome column. Our goal is to separate the symptoms column into many columns - each one containing one symptom.

Assuming the data are piped into separate(), first provide the column to be separated. Then provide into = as a vector c( ) containing the new columns names, as shown below.

  • sep = the separator, can be a character, or a number (interpreted as the character position to split at)
  • remove = FALSE by default, removes the input column
  • convert = FALSE by default, will cause string “NA”s to become NA
  • extra = this controls what happens if there are more values created by the separation than new columns named.
    • extra = "warn" means you will see a warning but it will drop excess values (the default)
    • extra = "drop" means the excess values will be dropped with no warning
    • extra = "merge" will only split to the number of new columns listed in into - this setting will preserve all your data

An example with extra = "merge" is below - no data is lost. Two new columns are defined but any third symptoms are left in the second new column:

# third symptoms combined into second new column
df %>% 
  separate(symptoms, into = c("sym_1", "sym_2"), sep=",", extra = "merge")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [3].
##   case_ID              sym_1          sym_2 outcome
## 1       1           jaundice  fever, chills Recover
## 2       2             chills   aches, pains   Death
## 3       3              fever           <NA>   Death
## 4       4           vomiting      diarrhoea Recover
## 5       5 bleeding from gums          fever Recover
## 6       6        rapid pulse       headache Recover

When the default extra = "drop" is used below, a warning is given but the third symptoms are lost:

# third symptoms are lost
df %>% 
  separate(symptoms, into = c("sym_1", "sym_2"), sep=",")
## Warning: Expected 2 pieces. Additional pieces discarded in 2 rows [1, 2].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [3].
##   case_ID              sym_1      sym_2 outcome
## 1       1           jaundice      fever Recover
## 2       2             chills      aches   Death
## 3       3              fever       <NA>   Death
## 4       4           vomiting  diarrhoea Recover
## 5       5 bleeding from gums      fever Recover
## 6       6        rapid pulse   headache Recover

CAUTION: If you do not provide enough into values for the new columns, your data may be truncated.

Arrange alphabetically

Several strings can be sorted by alphabetical order. str_order() returns the order, while str_sort() returns the strings in that order.

# strings
health_zones <- c("Alba", "Takota", "Delta")

# return the alphabetical order
str_order(health_zones)
## [1] 1 3 2
# return the strings in alphabetical order
str_sort(health_zones)
## [1] "Alba"   "Delta"  "Takota"

To use a different alphabet, add the argument locale =. See the full list of locales by entering stringi::stri_locale_list() in the R console.

base R functions

It is common to see base R functions paste() and paste0(), which concatenate vectors after converting all parts to character. They act similarly to str_c() but the syntax is arguably more complicated - in the parentheses each part is separated by a comma. The parts are either character text (in quotes) or pre-defined code objects (no quotes). For example:

n_beds <- 10
n_masks <- 20

paste0("Regional hospital needs ", n_beds, " beds and ", n_masks, " masks.")
## [1] "Regional hospital needs 10 beds and 20 masks."

sep = and collapse = arguments can be specified. paste() is simply paste0() with a default sep = " " (one space).

10.3 Clean and standardise

Change case

Often one must alter the case/capitalization of a string value, for example names of jursidictions. Use str_to_upper(), str_to_lower(), and str_to_title(), from stringr, as shown below:

str_to_upper("California")
## [1] "CALIFORNIA"
str_to_lower("California")
## [1] "california"

Using *base** R, the above can also be achieved with toupper(), tolower().

Title case

Transforming the string so each word is capitalized can be achieved with str_to_title():

str_to_title("go to the US state of california ")
## [1] "Go To The Us State Of California "

Use toTitleCase() from the tools package to achieve more nuanced capitalization (words like “to”, “the”, and “of” are not capitalized).

tools::toTitleCase("This is the US state of california")
## [1] "This is the US State of California"

You can also use str_to_sentence(), which capitalizes only the first letter of the string.

str_to_sentence("the patient must be transported")
## [1] "The patient must be transported"

Pad length

Use str_pad() to add characters to a string, to a minimum length. By default spaces are added, but you can also pad with other characters using the pad = argument.

# ICD codes of differing length
ICD_codes <- c("R10.13",
               "R10.819",
               "R17")

# ICD codes padded to 7 characters on the right side
str_pad(ICD_codes, 7, "right")
## [1] "R10.13 " "R10.819" "R17    "
# Pad with periods instead of spaces
str_pad(ICD_codes, 7, "right", pad = ".")
## [1] "R10.13." "R10.819" "R17...."

For example, to pad numbers with leading zeros (such as for hours or minutes), you can pad the number to minimum length of 2 with pad = "0".

# Add leading zeros to two digits (e.g. for times minutes/hours)
str_pad("4", 2, pad = "0") 
## [1] "04"
# example using a numeric column named "hours"
# hours <- str_pad(hours, 2, pad = "0")

Truncate

str_trunc() sets a maximum length for each string. If a string exceeds this length, it is truncated (shortened) and an ellipsis (…) is included to indicate that the string was previously longer. Note that the ellipsis is counted in the length. The ellipsis characters can be changed with the argument ellipsis =. The optional side = argument specifies which where the ellipsis will appear within the truncated string (“left”, “right”, or “center”).

original <- "Symptom onset on 4/3/2020 with vomiting"
str_trunc(original, 10, "center")
## [1] "Symp...ing"

Standardize length

Use str_trunc() to set a maximum length, and then use str_pad() to expand the very short strings to that truncated length. In the example below, 6 is set as the maximum length (one value is truncated), and then one very short value is padded to achieve length of 6.

# ICD codes of differing length
ICD_codes   <- c("R10.13",
                 "R10.819",
                 "R17")

# truncate to maximum length of 6
ICD_codes_2 <- str_trunc(ICD_codes, 6)
ICD_codes_2
## [1] "R10.13" "R10..." "R17"
# expand to minimum length of 6
ICD_codes_3 <- str_pad(ICD_codes_2, 6, "right")
ICD_codes_3
## [1] "R10.13" "R10..." "R17   "

Remove leading/trailing whitespace

Use str_trim() to remove spaces, newlines (\n) or tabs (\t) on sides of a string input. Add "right" "left", or "both" to the command to specify which side to trim (e.g. str_trim(x, "right").

# ID numbers with excess spaces on right
IDs <- c("provA_1852  ", # two excess spaces
         "provA_2345",   # zero excess spaces
         "provA_9460 ")  # one excess space

# IDs trimmed to remove excess spaces on right side only
str_trim(IDs)
## [1] "provA_1852" "provA_2345" "provA_9460"

Remove repeated whitespace within

Use str_squish() to remove repeated spaces that appear inside a string. For example, to convert double spaces into single spaces. It also removes spaces, newlines, or tabs on the outside of the string like str_trim().

# original contains excess spaces within string
str_squish("  Pt requires   IV saline\n") 
## [1] "Pt requires IV saline"

Enter ?str_trim, ?str_pad in your R console to see further details.

Wrap into paragraphs

Use str_wrap() to wrap a long unstructured text into a structured paragraph with fixed line length. Provide the ideal character length for each line, and it applies an algorithm to insert newlines (\n) within the paragraph, as seen in the example below.

pt_course <- "Symptom onset 1/4/2020 vomiting chills fever. Pt saw traditional healer in home village on 2/4/2020. On 5/4/2020 pt symptoms worsened and was admitted to Lumta clinic. Sample was taken and pt was transported to regional hospital on 6/4/2020. Pt died at regional hospital on 7/4/2020."

str_wrap(pt_course, 40)
## [1] "Symptom onset 1/4/2020 vomiting chills\nfever. Pt saw traditional healer in\nhome village on 2/4/2020. On 5/4/2020\npt symptoms worsened and was admitted\nto Lumta clinic. Sample was taken and pt\nwas transported to regional hospital on\n6/4/2020. Pt died at regional hospital\non 7/4/2020."

The base function cat() can be wrapped around the above command in order to print the output, displaying the new lines added.

cat(str_wrap(pt_course, 40))
## Symptom onset 1/4/2020 vomiting chills
## fever. Pt saw traditional healer in
## home village on 2/4/2020. On 5/4/2020
## pt symptoms worsened and was admitted
## to Lumta clinic. Sample was taken and pt
## was transported to regional hospital on
## 6/4/2020. Pt died at regional hospital
## on 7/4/2020.

10.4 Handle by position

Extract by character position

Use str_sub() to return only a part of a string. The function takes three main arguments:

  1. the character vector(s)
  2. start position
  3. end position

A few notes on position numbers:

  • If a position number is positive, the position is counted starting from the left end of the string.
  • If a position number is negative, it is counted starting from the right end of the string.
  • Position numbers are inclusive.
  • Positions extending beyond the string will be truncated (removed).

Below are some examples applied to the string “pneumonia”:

# start and end third from left (3rd letter from left)
str_sub("pneumonia", 3, 3)
## [1] "e"
# 0 is not present
str_sub("pneumonia", 0, 0)
## [1] ""
# 6th from left, to the 1st from right
str_sub("pneumonia", 6, -1)
## [1] "onia"
# 5th from right, to the 2nd from right
str_sub("pneumonia", -5, -2)
## [1] "moni"
# 4th from left to a position outside the string
str_sub("pneumonia", 4, 15)
## [1] "umonia"

Extract by word position

To extract the nth ‘word’, use word(), also from stringr. Provide the string(s), then the first word position to extract, and the last word position to extract.

By default, the separator between ‘words’ is assumed to be a space, unless otherwise indicated with sep = (e.g. sep = "_" when words are separated by underscores.

# strings to evaluate
chief_complaints <- c("I just got out of the hospital 2 days ago, but still can barely breathe.",
                      "My stomach hurts",
                      "Severe ear pain")

# extract 1st to 3rd words of each string
word(chief_complaints, start = 1, end = 3, sep = " ")
## [1] "I just got"       "My stomach hurts" "Severe ear pain"

Replace by character position

str_sub() paired with the assignment operator (<-) can be used to modify a part of a string:

word <- "pneumonia"

# convert the third and fourth characters to X 
str_sub(word, 3, 4) <- "XX"

# print
word
## [1] "pnXXmonia"

An example applied to multiple strings (e.g. a column). Note the expansion in length of “HIV”.

words <- c("pneumonia", "tubercolosis", "HIV")

# convert the third and fourth characters to X 
str_sub(words, 3, 4) <- "XX"

words
## [1] "pnXXmonia"    "tuXXrcolosis" "HIXX"

Evaluate length

str_length("abc")
## [1] 3

Alternatively, use nchar() from base R

10.5 Patterns

Many stringr functions work to detect, locate, extract, match, replace, and split based on a specified pattern.

Detect a pattern

Use str_detect() as below to detect presence/absence of a pattern within a string. First provide the string or vector to search in (string =), and then the pattern to look for (pattern =). Note that by default the search is case sensitive!

str_detect(string = "primary school teacher", pattern = "teach")
## [1] TRUE

The argument negate = can be included and set to TRUE if you want to know if the pattern is NOT present.

str_detect(string = "primary school teacher", pattern = "teach", negate = TRUE)
## [1] FALSE

To ignore case/capitalization, wrap the pattern within regex(), and within regex() add the argument ignore_case = TRUE (or T as shorthand).

str_detect(string = "Teacher", pattern = regex("teach", ignore_case = T))
## [1] TRUE

When str_detect() is applied to a character vector or a data frame column, it will return TRUE or FALSE for each of the values.

# a vector/column of occupations 
occupations <- c("field laborer",
                 "university professor",
                 "primary school teacher & tutor",
                 "tutor",
                 "nurse at regional hospital",
                 "lineworker at Amberdeen Fish Factory",
                 "physican",
                 "cardiologist",
                 "office worker",
                 "food service")

# Detect presence of pattern "teach" in each string - output is vector of TRUE/FALSE
str_detect(occupations, "teach")
##  [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

If you need to count the TRUEs, simply sum() the output. This counts the number TRUE.

sum(str_detect(occupations, "teach"))
## [1] 1

To search inclusive of multiple terms, include them separated by OR bars (|) within the pattern = argument, as shown below:

sum(str_detect(string = occupations, pattern = "teach|professor|tutor"))
## [1] 3

If you need to build a long list of search terms, you can combine them using str_c() and sep = |, then define this is a character object, and then reference the vector later more succinctly. The example below includes possible occupation search terms for front-line medical providers.

# search terms
occupation_med_frontline <- str_c("medical", "medicine", "hcw", "healthcare", "home care", "home health",
                                "surgeon", "doctor", "doc", "physician", "surgery", "peds", "pediatrician",
                               "intensivist", "cardiologist", "coroner", "nurse", "nursing", "rn", "lpn",
                               "cna", "pa", "physician assistant", "mental health",
                               "emergency department technician", "resp therapist", "respiratory",
                                "phlebotomist", "pharmacy", "pharmacist", "hospital", "snf", "rehabilitation",
                               "rehab", "activity", "elderly", "subacute", "sub acute",
                                "clinic", "post acute", "therapist", "extended care",
                                "dental", "dential", "dentist", sep = "|")

occupation_med_frontline
## [1] "medical|medicine|hcw|healthcare|home care|home health|surgeon|doctor|doc|physician|surgery|peds|pediatrician|intensivist|cardiologist|coroner|nurse|nursing|rn|lpn|cna|pa|physician assistant|mental health|emergency department technician|resp therapist|respiratory|phlebotomist|pharmacy|pharmacist|hospital|snf|rehabilitation|rehab|activity|elderly|subacute|sub acute|clinic|post acute|therapist|extended care|dental|dential|dentist"

This command returns the number of occupations which contain any one of the search terms for front-line medical providers (occupation_med_frontline):

sum(str_detect(string = occupations, pattern = occupation_med_frontline))
## [1] 2

Base R string search functions

The base function grepl() works similarly to str_detect(), in that it searches for matches to a pattern and returns a logical vector. The basic syntax is grepl(pattern, strings_to_search, ignore.case = FALSE, ...). One advantage is that the ignore.case argument is easier to write (there is no need to involve the regex() function).

Likewise, the base functions sub() and gsub() act similarly to str_replace(). Their basic syntax is: gsub(pattern, replacement, strings_to_search, ignore.case = FALSE). sub() will replace the first instance of the pattern, whereas gsub() will replace all instances of the pattern.

Convert commas to periods

Here is an example of using gsub() to convert commas to periods in a vector of numbers. This could be useful if your data come from parts of the world other than the United States or Great Britain.

The inner gsub() which acts first on lengths is converting any periods to no space "“. The period character”." has to be “escaped” with two slashes to actually signify a period, because “.” in regex means “any character”. Then, the result (with only commas) is passed to the outer gsub() in which commas are replaced by periods.

lengths <- c("2.454,56", "1,2", "6.096,5")

as.numeric(gsub(pattern = ",",                # find commas     
                replacement = ".",            # replace with periods
                x = gsub("\\.", "", lengths)  # vector with other periods removed (periods escaped)
                )
           )                                  # convert outcome to numeric

Replace all

Use str_replace_all() as a “find and replace” tool. First, provide the strings to be evaluated to string =, then the pattern to be replaced to pattern =, and then the replacement value to replacement =. The example below replaces all instances of “dead” with “deceased”. Note, this IS case sensitive.

outcome <- c("Karl: dead",
            "Samantha: dead",
            "Marco: not dead")

str_replace_all(string = outcome, pattern = "dead", replacement = "deceased")
## [1] "Karl: deceased"      "Samantha: deceased"  "Marco: not deceased"

Notes:

  • To replace a pattern with NA, use str_replace_na().
  • The function str_replace() replaces only the first instance of the pattern within each evaluated string.

Detect within logic

Within case_when()

str_detect() is often used within case_when() (from dplyr). Let’s say occupations is a column in the linelist. The mutate() below creates a new column called is_educator by using conditional logic via case_when(). See the page on data cleaning to learn more about case_when().

df <- df %>% 
  mutate(is_educator = case_when(
    # term search within occupation, not case sensitive
    str_detect(occupations,
               regex("teach|prof|tutor|university",
                     ignore_case = TRUE))              ~ "Educator",
    # all others
    TRUE                                               ~ "Not an educator"))

As a reminder, it may be important to add exclusion criteria to the conditional logic (negate = F):

df <- df %>% 
  # value in new column is_educator is based on conditional logic
  mutate(is_educator = case_when(
    
    # occupation column must meet 2 criteria to be assigned "Educator":
    # it must have a search term AND NOT any exclusion term
    
    # Must have a search term
    str_detect(occupations,
               regex("teach|prof|tutor|university", ignore_case = T)) &              
    
    # AND must NOT have an exclusion term
    str_detect(occupations,
               regex("admin", ignore_case = T),
               negate = TRUE                        ~ "Educator"
    
    # All rows not meeting above criteria
    TRUE                                            ~ "Not an educator"))

Locate pattern position

To locate the first position of a pattern, use str_locate(). It outputs a start and end position.

str_locate("I wish", "sh")
##      start end
## [1,]     5   6

Like other str functions, there is an "_all" version (str_locate_all()) which will return the positions of all instances of the pattern within each string. This outputs as a list.

phrases <- c("I wish", "I hope", "he hopes", "He hopes")

str_locate(phrases, "h" )     # position of *first* instance of the pattern
##      start end
## [1,]     6   6
## [2,]     3   3
## [3,]     1   1
## [4,]     4   4
str_locate_all(phrases, "h" ) # position of *every* instance of the pattern
## [[1]]
##      start end
## [1,]     6   6
## 
## [[2]]
##      start end
## [1,]     3   3
## 
## [[3]]
##      start end
## [1,]     1   1
## [2,]     4   4
## 
## [[4]]
##      start end
## [1,]     4   4

Extract a match

str_extract_all() returns the matching patterns themselves, which is most useful when you have offered several patterns via “OR” conditions. For example, looking in the string vector of occupations (see previous tab) for either “teach”, “prof”, or “tutor”.

str_extract_all() returns a list which contains all matches for each evaluated string. See below how occupation 3 has two pattern matches within it.

str_extract_all(occupations, "teach|prof|tutor")
## [[1]]
## character(0)
## 
## [[2]]
## [1] "prof"
## 
## [[3]]
## [1] "teach" "tutor"
## 
## [[4]]
## [1] "tutor"
## 
## [[5]]
## character(0)
## 
## [[6]]
## character(0)
## 
## [[7]]
## character(0)
## 
## [[8]]
## character(0)
## 
## [[9]]
## character(0)
## 
## [[10]]
## character(0)

str_extract() extracts only the first match in each evaluated string, producing a character vector with one element for each evaluated string. It returns NA where there was no match. The NAs can be removed by wrapping the returned vector with na.exclude(). Note how the second of occupation 3’s matches is not shown.

str_extract(occupations, "teach|prof|tutor")
##  [1] NA      "prof"  "teach" "tutor" NA      NA      NA      NA      NA      NA

Subset and count

Aligned functions include str_subset() and str_count().

str_subset() returns the actual values which contained the pattern:

str_subset(occupations, "teach|prof|tutor")
## [1] "university professor"           "primary school teacher & tutor" "tutor"

str_count() returns a vector of numbers: the number of times a search term appears in each evaluated value.

str_count(occupations, regex("teach|prof|tutor", ignore_case = TRUE))
##  [1] 0 1 2 1 0 0 0 0 0 0

Regex groups

UNDER CONSTRUCTION

10.6 Special characters

Backslash \ as escape

The backslash \ is used to “escape” the meaning of the next character. This way, a backslash can be used to have a quote mark display within other quote marks (\") - the middle quote mark will not “break” the surrounding quote marks.

Note - thus, if you want to display a backslash, you must escape it’s meaning with another backslash. So you must write two backslashes \\ to display one.

Special characters

Special character Represents
"\\" backslash
"\n" a new line (newline)
"\"" double-quote within double quotes
'\'' single-quote within single quotes
"\| grave accent| carriage return| tab| vertical tab"` backspace

Run ?"'" in the R Console to display a complete list of these special characters (it will appear in the RStudio Help pane).

10.7 Regular expressions (regex)

10.8 Regex and special characters

Regular expressions, or “regex”, is a concise language for describing patterns in strings. If you are not familiar with it, a regular expression can look like an alien language. Here we try to de-mystify this language a little bit.

Much of this section is adapted from this tutorial and this cheatsheet. We selectively adapt here knowing that this handbook might be viewed by people without internet access to view the other tutorials.

A regular expression is often applied to extract specific patterns from “unstructured” text - for example medical notes, chief complaints, patient history, or other free text columns in a data frame

There are four basic tools one can use to create a basic regular expression:

  1. Character sets
  2. Meta characters
  3. Quantifiers
  4. Groups

Character sets

Character sets, are a way of expressing listing options for a character match, within brackets. So any a match will be triggered if any of the characters within the brackets are found in the string. For example, to look for vowels one could use this character set: “[aeiou]”. Some other common character sets are:

Character set Matches for
"[A-Z]" any single capital letter
"[a-z]" any single lowercase letter
"[0-9]" any digit
[:alnum:] any alphanumeric character
[:digit:] any numeric digit
[:alpha:] any letter (upper or lowercase)
[:upper:] any uppercase letter
[:lower:] any lowercase letter

Character sets can be combined within one bracket (no spaces!), such as "[A-Za-z]" (any upper or lowercase letter), or another example "[t-z0-5]" (lowercase t through z OR number 0 through 5).

Meta characters

Meta characters are shorthand for character sets. Some of the important ones are listed below:

Meta character Represents
"\\s" a single space
"\\w" any single alphanumeric character (A-Z, a-z, or 0-9)
"\\d" any single numeric digit (0-9)

Quantifiers

Typically you do not want to search for a match on only one character. Quantifiers allow you to designate the length of letters/numbers to allow for the match.

Quantifiers are numbers written within curly brackets { } after the character they are quantifying, for example,

  • "A{2}" will return instances of two capital A letters.
  • "A{2,4}" will return instances of between two and four capital A letters (do not put spaces!).
  • "A{2,}" will return instances of two or more capital A letters.
  • "A+" will return instances of one or more capital A letters (group extended until a different character is encountered).
  • Precede with an * asterisk to return zero or more matches (useful if you are not sure the pattern is present)

Using the + plus symbol as a quantifier, the match will occur until a different character is encountered. For example, this expression will return all words (alpha characters: "[A-Za-z]+"

# test string for quantifiers
test <- "A-AA-AAA-AAAA"

When a quantifier of {2} is used, only pairs of consecutive A’s are returned. Two pairs are identified within AAAA.

str_extract_all(test, "A{2}")
## [[1]]
## [1] "AA" "AA" "AA" "AA"

When a quantifier of {2,4} is used, groups of consecutive A’s that are two to four in length are returned.

str_extract_all(test, "A{2,4}")
## [[1]]
## [1] "AA"   "AAA"  "AAAA"

With the quantifier +, groups of one or more are returned:

str_extract_all(test, "A+")
## [[1]]
## [1] "A"    "AA"   "AAA"  "AAAA"

Relative position

These express requirements for what precedes or follows a pattern. For example, to extract sentences, “two numbers that are followed by a period” (""). (?<=\.)\s(?=[A-Z])

str_extract_all(test, "")
## [[1]]
##  [1] "A" "-" "A" "A" "-" "A" "A" "A" "-" "A" "A" "A" "A"
Position statement Matches to
"(?<=b)a" “a” that is preceded by a “b”
"(?<!b)a" “a” that is NOT preceded by a “b”
"a(?=b)" “a” that is followed by a “b”
"a(?!b)" “a” that is NOT followed by a “b”

Groups

Capturing groups in your regular expression is a way to have a more organized output upon extraction.

Regex examples

Below is a free text for the examples. We will try to extract useful information from it using a regular expression search term.

pt_note <- "Patient arrived at Broward Hospital emergency ward at 18:00 on 6/12/2005. Patient presented with radiating abdominal pain from LR quadrant. Patient skin was pale, cool, and clammy. Patient temperature was 99.8 degrees farinheit. Patient pulse rate was 100 bpm and thready. Respiratory rate was 29 per minute."

This expression matches to all words (any character until hitting non-character such as a space):

str_extract_all(pt_note, "[A-Za-z]+")
## [[1]]
##  [1] "Patient"     "arrived"     "at"          "Broward"     "Hospital"    "emergency"   "ward"        "at"          "on"          "Patient"     "presented"  
## [12] "with"        "radiating"   "abdominal"   "pain"        "from"        "LR"          "quadrant"    "Patient"     "skin"        "was"         "pale"       
## [23] "cool"        "and"         "clammy"      "Patient"     "temperature" "was"         "degrees"     "farinheit"   "Patient"     "pulse"       "rate"       
## [34] "was"         "bpm"         "and"         "thready"     "Respiratory" "rate"        "was"         "per"         "minute"

The expression "[0-9]{1,2}" matches to consecutive numbers that are 1 or 2 digits in length. It could also be written "\\d{1,2}", or "[:digit:]{1,2}".

str_extract_all(pt_note, "[0-9]{1,2}")
## [[1]]
##  [1] "18" "00" "6"  "12" "20" "05" "99" "8"  "10" "0"  "29"

You can view a useful list of regex expressions and tips on page 2 of this cheatsheet

Also see this tutorial.

10.9 Resources

A reference sheet for stringr functions can be found here

A vignette on stringr can be found here

11 Factors

In R, factors are a class of data that allow for ordered categories with a fixed set of acceptable values.

Typically, you would convert a column from character or numeric class to a factor if you want to set an intrinsic order to the values (“levels”) so they can be displayed non-alphabetically in plots and tables. Another common use of factors is to standardise the legends of plots so they do not fluctuate if certain values are temporarily absent from the data.

This page demonstrates use of functions from the package forcats (a short name for “For categorical variables”) and some base R functions. We also touch upon the use of lubridate and aweek for special factor cases related to epidemiological weeks.

A complete list of forcats functions can be found online here. Below we demonstrate some of the most common ones.

11.1 Preparation

Load packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
  rio,           # import/export
  here,          # filepaths
  lubridate,     # working with dates
  forcats,       # factors
  aweek,         # create epiweeks with automatic factor levels
  janitor,       # tables
  tidyverse      # data mgmt and viz
  )

Import data

We import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import your data with the import() function from the rio package (it accepts many file types like .xlsx, .rds, .csv - see the Import and export page for details).

# import your dataset
linelist <- import("linelist_cleaned.rds")

New categorical variable

For demonstration in this page we will use a common scenario - the creation of a new categorical variable.

Note that if you convert a numeric column to class factor, you will not be able to calculate numeric statistics on it.

Create column

We use the existing column days_onset_hosp (days from symptom onset to hospital admission) and create a new column delay_cat by classifying each row into one of several categories. We do this with the dplyr function case_when(), which sequentially applies logical criteria (right-side) to each row and returns the corresponding left-side value for the new column delay_cat. Read more about case_when() in Cleaning data and core functions.

linelist <- linelist %>% 
  mutate(delay_cat = case_when(
    # criteria                                   # new value if TRUE
    days_onset_hosp < 2                        ~ "<2 days",
    days_onset_hosp >= 2 & days_onset_hosp < 5 ~ "2-5 days",
    days_onset_hosp >= 5                       ~ ">5 days",
    is.na(days_onset_hosp)                     ~ NA_character_,
    TRUE                                       ~ "Check me"))  

Default value order

As created with case_when(), the new column delay_cat is a categorical column of class Character - not yet a factor. Thus, in a frequency table, we see that the unique values appear in a default alpha-numeric order - an order that does not make much intuitive sense:

table(linelist$delay_cat, useNA = "always")
## 
##  <2 days  >5 days 2-5 days     <NA> 
##     2990      602     2040      256

Likewise, if we make a bar plot, the values also appear in this order on the x-axis (see the ggplot basics page for more on ggplot2 - the most common visualization package in R).

ggplot(data = linelist)+
  geom_bar(mapping = aes(x = delay_cat))

11.2 Convert to factor

To convert a character or numeric column to class factor, you can use any function from the forcats package (many are detailed below). They will convert to class factor and then also perform or allow certain ordering of the levels - for example using fct_relevel() lets you manually specify the level order. The function as_factor() simply converts the class without any further capabilities.

The base R function factor() converts a column to factor and allows you to manually specify the order of the levels, as a character vector to its levels = argument.

Below we use mutate() and fct_relevel() to convert the column delay_cat from class character to class factor. The column delay_cat is created in the Preparation section above.

linelist <- linelist %>%
  mutate(delay_cat = fct_relevel(delay_cat))

The unique “values” in this column are now considered “levels” of the factor. The levels have an order, which can be printed with the base R function levels(), or alternatively viewed in a count table via table() from base R or tabyl() from janitor. By default, the order of the levels will be alpha-numeric, as before. Note that NA is not a factor level.

levels(linelist$delay_cat)
## [1] "<2 days"  ">5 days"  "2-5 days"

The function fct_relevel() has the additional utility of allowing you to manually specify the level order. Simply write the level values in order, in quotation marks, separated by commas, as shown below. Note that the spelling must exactly match the values. If you want to create levels that do not exist in the data, use fct_expand() instead).

linelist <- linelist %>%
  mutate(delay_cat = fct_relevel(delay_cat, "<2 days", "2-5 days", ">5 days"))

We can now see that the levels are ordered, as specified in the previous command, in a sensible order.

levels(linelist$delay_cat)
## [1] "<2 days"  "2-5 days" ">5 days"

Now the plot order makes more intuitive sense as well.

ggplot(data = linelist)+
  geom_bar(mapping = aes(x = delay_cat))

11.3 Add or drop levels

Add

If you need to add levels to a factor, you can do this with fct_expand(). Just write the column name followed by the new levels (separated by commas). By tabulating the values, we can see the new levels and the zero counts. You can use table() from base R, or tabyl() from janitor:

linelist %>% 
  mutate(delay_cat = fct_expand(delay_cat, "Not admitted to hospital", "Transfer to other jurisdiction")) %>% 
  tabyl(delay_cat)   # print table
##                       delay_cat    n    percent valid_percent
##                         <2 days 2990 0.50781250     0.5308949
##                        2-5 days 2040 0.34646739     0.3622159
##                         >5 days  602 0.10224185     0.1068892
##        Not admitted to hospital    0 0.00000000     0.0000000
##  Transfer to other jurisdiction    0 0.00000000     0.0000000
##                            <NA>  256 0.04347826            NA

Note: there is a special forcats function to easily add missing values (NA) as a level. See the section on Missing values below.

Drop

If you use fct_drop(), the “unused” levels with zero counts will be dropped from the set of levels. The levels we added above (“Not admitted to a hospital”) exists as a level but no rows actually have those values. So they will be dropped by applying fct_drop() to our factor column:

linelist %>% 
  mutate(delay_cat = fct_drop(delay_cat)) %>% 
  tabyl(delay_cat)
##  delay_cat    n    percent valid_percent
##    <2 days 2990 0.50781250     0.5308949
##   2-5 days 2040 0.34646739     0.3622159
##    >5 days  602 0.10224185     0.1068892
##       <NA>  256 0.04347826            NA

11.4 Adjust level order

The package forcats offers useful functions to easily adjust the order of a factor’s levels (after a column been defined as class factor):

These functions can be applied to a factor column in two contexts:

  1. To the column in the data frame, as usual, so the transformation is available for any subsequent use of the data
  2. Inside of a plot, so that the change is applied only within the plot

Manually

This function is used to manually order the factor levels. If used on a non-factor column, the column will first be converted to class factor.

Within the parentheses first provide the factor column name, then provide either:

  • All the levels in the desired order (as a character vector c()), or
  • One level and it’s corrected placement using the after = argument

Here is an example of redefining the column delay_cat (which is already class Factor) and specifying all the desired order of levels.

# re-define level order
linelist <- linelist %>% 
  mutate(delay_cat = fct_relevel(delay_cat, c("<2 days", "2-5 days", ">5 days")))

If you only want to move one level, you can specify it to fct_relevel() alone and give a number to the after = argument to indicate where in the order it should be. For example, the command below shifts “<2 days” to the second position:

# re-define level order
linelist %>% 
  mutate(delay_cat = fct_relevel(delay_cat, "<2 days", after = 1)) %>% 
  tabyl(delay_cat)

Within a plot

The forcats commands can be used to set the level order in the data frame, or only within a plot. By using the command to “wrap around” the column name within the ggplot() plotting command, you can reverse/relevel/etc. the transformation will only apply within that plot.

Below, two plots are created with ggplot() (see the ggplot basics page). In the first, the delay_cat column is mapped to the x-axis of the plot, with it’s default level order as in the data linelist. In the second example it is wrapped within fct_relevel() and the order is changed in the plot.

# Alpha-numeric default order - no adjustment within ggplot
ggplot(data = linelist)+
    geom_bar(mapping = aes(x = delay_cat))

# Factor level order adjusted within ggplot
ggplot(data = linelist)+
  geom_bar(mapping = aes(x = fct_relevel(delay_cat, c("<2 days", "2-5 days", ">5 days"))))

Note that default x-axis title is now quite complicated - you can overwrite this title with the ggplot2 labs() argument.

Reverse

It is rather common that you want to reverse the level order. Simply wrap the factor with fct_rev().

Note that if you want to reverse only a plot legend but not the actual factor levels, you can do that with guides() (see ggplot tips).

By frequency

To order by frequency that the value appears in the data, use fct_infreq(). Any missing values (NA) will automatically be included at the end, unless they are converted to an explicit level (see this section). You can reverse the order by further wrapping with fct_rev().

This function can be used within a ggplot(), as shown below.

# ordered by frequency
ggplot(data = linelist, aes(x = fct_infreq(delay_cat)))+
  geom_bar()+
  labs(x = "Delay onset to admission (days)",
       title = "Ordered by frequency")

# reversed frequency
ggplot(data = linelist, aes(x = fct_rev(fct_infreq(delay_cat))))+
  geom_bar()+
  labs(x = "Delay onset to admission (days)",
       title = "Reverse of order by frequency")

By appearance

Use fct_inorder() to set the level order to match the order of appearance in the data, starting from the first row. This can be useful if you first carefully arrange() the data in the data frame, and then use this to set the factor order.

By summary statistic of another column

You can use fct_reorder() to order the levels of one column by a summary statistic of another column. Visually, this can result in pleasing plots where the bars/points ascend or descend steadily across the plot.

In the examples below, the x-axis is delay_cat, and the y-axis is numeric column ct_blood (cycle-threshold value). Box plots show the CT value distribution by delay_cat group. We want to order the box plots in ascending order by the group median CT value.

In the first example below, the default order alpha-numeric level order is used. You can see the box plot heights are jumbled and not in any particular order. In the second example, the delay_cat column (mapped to the x-axis) has been wrapped in fct_reorder(), the column ct_blood is given as the second argument, and “median” is given as the third argument (you could also use “max”, “mean”, “min”, etc). Thus, the order of the levels of delay_cat will now reflect ascending median CT values of each delay_cat group’s median CT value. This is reflected in the second plot - the box plots have been re-arranged to ascend. Note how NA (missing) will appear at the end, unless converted to an explicit level.

# boxplots ordered by original factor levels
ggplot(data = linelist)+
  geom_boxplot(
    aes(x = delay_cat,
        y = ct_blood, 
        fill = delay_cat))+
  labs(x = "Delay onset to admission (days)",
       title = "Ordered by original alpha-numeric levels")+
  theme_classic()+
  theme(legend.position = "none")


# boxplots ordered by median CT value
ggplot(data = linelist)+
  geom_boxplot(
    aes(x = fct_reorder(delay_cat, ct_blood, "median"),
        y = ct_blood,
        fill = delay_cat))+
  labs(x = "Delay onset to admission (days)",
       title = "Ordered by median CT value in group")+
  theme_classic()+
  theme(legend.position = "none")

Note in this example above there are no steps required prior to the ggplot() call - the grouping and calculations are all done internally to the ggplot command.

By “end” value

Use fct_reorder2() for grouped line plots. It orders the levels (and therefore the legend) to align with the vertical ordering of the lines at the “end” of the plot. Technically speaking, it “orders by the y-values associated with the largest x values.”

For example, if you have lines showing case counts by hospital over time, you can apply fct_reorder2() to the color = argument within aes(), such that the vertical order of hospitals appearing in the legend aligns with the order of lines at the terminal end of the plot. Read more in the online documentation.

epidemic_data <- linelist %>%         # begin with the linelist   
    filter(date_onset < as.Date("2014-09-21")) %>%    # cut-off date, for visual clarity
    count(                                            # get case counts per week and by hospital
      epiweek = lubridate::floor_date(date_onset, "week"),  
      hospital                                            
    ) 
  
ggplot(data = epidemic_data)+                       # start plot
  geom_line(                                        # make lines
    aes(
      x = epiweek,                                  # x-axis epiweek
      y = n,                                        # height is number of cases per week
      color = fct_reorder2(hospital, epiweek, n)))+ # data grouped and colored by hospital, with factor order by height at end of plot
  labs(title = "Factor levels (and legend display) by line height at end of plot",
       color = "Hospital")                          # change legend title

11.5 Missing values

If you have NA values in your factor column, you can easily convert them to a named level such as “Missing” with fct_explicit_na(). The NA values are converted to “(Missing)” at the end of the level order by default. You can adjust the level name with the argument na_level =.

Below, this opertation is performed on the column delay_cat and a table is printed with tabyl() with NA converted to “Missing delay”.

linelist %>% 
  mutate(delay_cat = fct_explicit_na(delay_cat, na_level = "Missing delay")) %>% 
  tabyl(delay_cat)
##      delay_cat    n    percent
##       2-5 days 2040 0.34646739
##        <2 days 2990 0.50781250
##        >5 days  602 0.10224185
##  Missing delay  256 0.04347826

11.6 Combine levels

Manually

You can adjust the level displays manually manually with fct_recode(). This is like the dplyr function recode() (see Cleaning data and core functions), but it allows the creation of new factor levels. If you use the simple recode() on a factor, new re-coded values will be rejected unless they have already been set as permissible levels.

This tool can also be used to “combine” levels, by assigning multiple levels the same re-coded value. Just be careful to not lose information! Consider doing these combining steps in a new column (not over-writing the existing column).

fct_recode() has a different syntax than recode(). recode() uses OLD = NEW, whereas fct_recode() uses NEW = OLD.

The current levels of delay_cat are:

levels(linelist$delay_cat)
## [1] "<2 days"  "2-5 days" ">5 days"

The new levels are created using syntax fct_recode(column, "new" = "old", "new" = "old", "new" = "old") and printed:

linelist %>% 
  mutate(delay_cat = fct_recode(
    delay_cat,
    "Less than 2 days" = "<2 days",
    "2 to 5 days"      = "2-5 days",
    "More than 5 days" = ">5 days")) %>% 
  tabyl(delay_cat)
##         delay_cat    n    percent valid_percent
##  Less than 2 days 2990 0.50781250     0.5308949
##       2 to 5 days 2040 0.34646739     0.3622159
##  More than 5 days  602 0.10224185     0.1068892
##              <NA>  256 0.04347826            NA

Here they are manually combined with fct_recode(). Note there is no error raised at the creation of a new level “Less than 5 days”.

linelist %>% 
  mutate(delay_cat = fct_recode(
    delay_cat,
    "Less than 5 days" = "<2 days",
    "Less than 5 days" = "2-5 days",
    "More than 5 days" = ">5 days")) %>% 
  tabyl(delay_cat)
##         delay_cat    n    percent valid_percent
##  Less than 5 days 5030 0.85427989     0.8931108
##  More than 5 days  602 0.10224185     0.1068892
##              <NA>  256 0.04347826            NA

Reduce into “Other”

You can use fct_other() to manually assign factor levels to an “Other” level. Below, all levels in the column hospital, aside from “Port Hospital” and “Central Hospital”, are combined into “Other”. You can provide a vector to either keep =, or drop =. You can change the display of the “Other” level with other_level =.

linelist %>%    
  mutate(hospital = fct_other(                      # adjust levels
    hospital,
    keep = c("Port Hospital", "Central Hospital"),  # keep these separate
    other_level = "Other Hospital")) %>%            # All others as "Other Hospital"
  tabyl(hospital)                                   # print table
##          hospital    n    percent
##  Central Hospital  454 0.07710598
##     Port Hospital 1762 0.29925272
##    Other Hospital 3672 0.62364130

Reduce by frequency

You can combine the least-frequent factor levels automatically using fct_lump().

To “lump” together many low-frequency levels into an “Other” group, do one of the following:

  • Set n = as the number of groups you want to keep. The n most-frequent levels will be kept, and all others will combine into “Other”.
  • Set prop = as the threshold frequency proportion for levels above which you want to keep. All other values will combine into “Other”.

You can change the display of the “Other” level with other_level =. Below, all but the two most-frequent hospitals are combined into “Other Hospital”.

linelist %>%    
  mutate(hospital = fct_lump(                      # adjust levels
    hospital,
    n = 2,                                          # keep top 2 levels
    other_level = "Other Hospital")) %>%            # all others as "Other Hospital"
  tabyl(hospital)                                   # print table
##        hospital    n   percent
##         Missing 1469 0.2494905
##   Port Hospital 1762 0.2992527
##  Other Hospital 2657 0.4512568

, warn ## Show all levels

One benefit of using factors is to standardise the appearance of plot legends and tables, regardless of which values are actually present in a dataset.

If you are preparing many figures (e.g. for multiple jurisdictions) you will want the legends and tables to appear identically even with varying levels of data completion or data composition.

In plots

In a ggplot() figure, simply add the argument drop = FALSE in the relevant scale_xxxx() function. All factor levels will be displayed, regardless of whether they are present in the data. If your factor column levels are displayed using fill =, then in scale_fill_discrete() you include drop = FALSE, as shown below. If your levels are displayed with x = (to the x-axis) color = or size = you would provide this to scale_color_discrete() or scale_size_discrete() accordingly.

This example is a stacked bar plot of age category, by hospital. Adding scale_fill_discrete(drop = FALSE) ensures that all age groups appear in the legend, even if not present in the data.

ggplot(data = linelist)+
  geom_bar(mapping = aes(x = hospital, fill = age_cat)) +
  scale_fill_discrete(drop = FALSE)+                        # show all age groups in the legend, even those not present
  labs(
    title = "All age groups will appear in legend, even if not present in data")

In tables

Both the base R table() and tabyl() from janitor will show all factor levels (even unused levels).

If you use count() or summarise() from dplyr to make a table, add the argument .drop = FALSE to include counts for all factor levels even those unused.

Read more in the Descriptive tables page, or at the scale_discrete documentation, or the count() documentation. You can see another example in the Contact tracing page.

11.7 Epiweeks

Please see the extensive discussion of how to create epidemiological weeks in the Grouping data page.
Please also see the Working with dates page for tips on how to create and format epidemiological weeks.

Epiweeks in a plot

If your goal is to create epiweeks to display in a plot, you can do this simply with lubridate’s floor_date(), as explained in the Grouping data page. The values returned will be of class Date with format YYYY-MM-DD. If you use this column in a plot, the dates will naturally order correctly, and you do not need to worry about levels or converting to class Factor. See the ggplot() histogram of onset dates below.

In this approach, you can adjust the display of the dates on an axis with scale_x_date(). See the page on Epidemic curves for more information. You can specify a “strptime” display format to the date_labels = argument of scale_x_date(). These formats use “%” placeholders and are covered in the Working with dates page. Use “%Y” to represent a 4-digit year, and either “%W” or “%U” to represent the week number (Monday or Sunday weeks respectively).

linelist %>% 
  mutate(epiweek_date = floor_date(date_onset, "week")) %>%  # create week column
  ggplot()+                                                  # begin ggplot
  geom_histogram(mapping = aes(x = epiweek_date))+           # histogram of date of onset
  scale_x_date(date_labels = "%Y-W%W")                       # adjust disply of dates to be YYYY-WWw

Epiweeks in the data

However, if your purpose in factoring is not to plot, you can approach this one of two ways:

  1. For fine control over the display, convert the lubridate epiweek column (YYYY-MM-DD) to the desired display format (YYYY-WWw) within the data frame itself, and then convert it to class Factor.

First, use format() from base R to convert the date display from YYYY-MM-DD to YYYY-Www display (see the Working with dates page). In this process the class will be converted to character. Then, convert from character to class Factor with factor().

linelist <- linelist %>% 
  mutate(epiweek_date = floor_date(date_onset, "week"),       # create epiweeks (YYYY-MM-DD)
         epiweek_formatted = format(epiweek_date, "%Y-W%W"),  # Convert to display (YYYY-WWw)
         epiweek_formatted = factor(epiweek_formatted))       # Convert to factor

# Display levels
levels(linelist$epiweek_formatted)
##  [1] "2014-W13" "2014-W14" "2014-W15" "2014-W16" "2014-W17" "2014-W18" "2014-W19" "2014-W20" "2014-W21" "2014-W22" "2014-W23" "2014-W24" "2014-W25" "2014-W26"
## [15] "2014-W27" "2014-W28" "2014-W29" "2014-W30" "2014-W31" "2014-W32" "2014-W33" "2014-W34" "2014-W35" "2014-W36" "2014-W37" "2014-W38" "2014-W39" "2014-W40"
## [29] "2014-W41" "2014-W42" "2014-W43" "2014-W44" "2014-W45" "2014-W46" "2014-W47" "2014-W48" "2014-W49" "2014-W50" "2014-W51" "2015-W00" "2015-W01" "2015-W02"
## [43] "2015-W03" "2015-W04" "2015-W05" "2015-W06" "2015-W07" "2015-W08" "2015-W09" "2015-W10" "2015-W11" "2015-W12" "2015-W13" "2015-W14" "2015-W15" "2015-W16"

DANGER: If you place the weeks ahead of the years (“Www-YYYY”) (“%W-%Y”), the default alpha-numeric level ordering will be incorrect (e.g. 01-2015 will be before 35-2014). You could need to manually adjust the order, which would be a long painful process.

  1. For fast default display, use the aweek package and it’s function date2week(). You can set the week_start = day, and if you set factor = TRUE then the output column is an ordered factor. As a bonus, the factor includes levels for all possible weeks in the span - even if there are no cases that week.
df <- linelist %>% 
  mutate(epiweek = date2week(date_onset, week_start = "Monday", factor = TRUE))

levels(df$epiweek)

See the Working with dates page for more information about aweek. It also offers the reverse function week2date().

11.8 Resources

R for Data Science page on factors
aweek package vignette

12 Pivoting data

When managing data, pivoting can be understood to refer to one of two processes:

  1. The creation of pivot tables, which are tables of statistics that summarise the data of a more extensive table
  2. The conversion of a table from long to wide format, or vice versa.

In this page, we will focus on the latter definition. The former is a crucial step in data analysis, and is covered elsewhere in the Grouping data and Descriptive tables pages.

This page discusses the formats of data. It is useful to be aware of the idea of “tidy data”, in which each variable has it’s own column, each observation has it’s own row, and each value has it’s own cell. More about this topic can be found at this online chapter in R for Data Science.

12.1 Preparation

Load packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
  rio,          # File import
  here,         # File locator
  tidyverse)    # data management + ggplot2 graphics

Import data

Malaria count data

In this page, we will use a fictional dataset of daily malaria cases, by facility and age group. If you want to follow along, click here to download (as .rds file). Import data with the import() function from the rio package (it handles many file types like .xlsx, .csv, .rds - see the Import and export page for details).

# Import data
count_data <- import("malaria_facility_count_data.rds")

The first 50 rows are displayed below.

Linelist case data

In the later part of this page, we will also use the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import your data with the import() function from the rio package (it accepts many file types like .xlsx, .rds, .csv - see the Import and export page for details).

# import your dataset
linelist <- import("linelist_cleaned.xlsx")

12.2 Wide-to-long

“Wide” format

Data are often entered and stored in a “wide” format - where a subject’s characteristics or responses are stored in a single row. While this may be useful for presentation, it is not ideal for some types of analysis.

Let us take the count_data dataset imported in the Preparation section above as an example. You can see that each row represents a “facility-day”. The actual case counts (the right-most columns) are stored in a “wide” format such that the information for every age group on a given facility-day is stored in a single row.

Each observation in this dataset refers to the malaria counts at one of 65 facilities on a given date, ranging from count_data$data_date %>% min() to count_data$data_date %>% max(). These facilities are located in one Province (North) and four Districts (Spring, Bolo, Dingo, and Barnard). The dataset provides the overall counts of malaria, as well as age-specific counts in each of three age groups - <4 years, 5-14 years, and 15 years and older.

“Wide” data like this are not adhering to “tidy data” standards, because the column headers do not actually represent “variables” - they represent values of a hypothetical “age group” variable.

This format can be useful for presenting the information in a table, or for entering data (e.g. in Excel) from case report forms. However, in the analysis stage, these data typically should be transformed to a “longer” format more aligned with “tidy data” standards. The plotting R package ggplot2 in particular works best when data are in a “long” format.

Visualising the total malaria counts over time poses no difficulty with the data in it’s current format:

ggplot(count_data) +
  geom_col(aes(x = data_date, y = malaria_tot), width = 1)

However, what if we wanted to display the relative contributions of each age group to this total count? In this case, we need to ensure that the variable of interest (age group), appears in the dataset in a single column that can be passed to {ggplot2}’s “mapping aesthetics” aes() argument.

pivot_longer()

The tidyr function pivot_longer() makes data “longer”. tidyr is part of the tidyverse of R packages.

It accepts a range of columns to transform (specified to cols =). Therefore, it can operate on only a part of a dataset. This is useful for the malaria data, as we only want to pivot the case count columns.

In this process, you will end up with two “new” columns - one with the categories (the former column names), and one with the corresponding values (e.g. case counts). You can accept the default names for these new columns, or you can specify your own to names_to = and values_to = respectively.

Let’s see pivot_longer() in action…

Standard pivoting

We want to use tidyr’s pivot_longer() function to convert the “wide” data to a “long” format. Specifically, to convert the four numeric columns with data on malaria counts to two new columns: one which holds the age groups and one which holds the corresponding values.

df_long <- count_data %>% 
  pivot_longer(
    cols = c(`malaria_rdt_0-4`, `malaria_rdt_5-14`, `malaria_rdt_15`, `malaria_tot`)
  )

df_long

Notice that the newly created data frame (df_long) has more rows (12,152 vs 3,038); it has become longer. In fact, it is precisely four times as long, because each row in the original dataset now represents four rows in df_long, one for each of the malaria count observations (<4y, 5-14y, 15y+, and total).

In addition to becoming longer, the new dataset has fewer columns (8 vs 10), as the data previously stored in four columns (those beginning with the prefix malaria_) is now stored in two.

Since the names of these four columns all begin with the prefix malaria_, we could have made use of the handy “tidyselect” function starts_with() to achieve the same result (see the page Cleaning data and core functions for more of these helper functions).

# provide column with a tidyselect helper function
count_data %>% 
  pivot_longer(
    cols = starts_with("malaria_")
  )
## # A tibble: 12,152 x 8
##    location_name data_date  submitted_date Province District newid name             value
##    <chr>         <date>     <date>         <chr>    <chr>    <int> <chr>            <int>
##  1 Facility 1    2020-08-11 2020-08-12     North    Spring       1 malaria_rdt_0-4     11
##  2 Facility 1    2020-08-11 2020-08-12     North    Spring       1 malaria_rdt_5-14    12
##  3 Facility 1    2020-08-11 2020-08-12     North    Spring       1 malaria_rdt_15      23
##  4 Facility 1    2020-08-11 2020-08-12     North    Spring       1 malaria_tot         46
##  5 Facility 2    2020-08-11 2020-08-12     North    Bolo         2 malaria_rdt_0-4     11
##  6 Facility 2    2020-08-11 2020-08-12     North    Bolo         2 malaria_rdt_5-14    10
##  7 Facility 2    2020-08-11 2020-08-12     North    Bolo         2 malaria_rdt_15       5
##  8 Facility 2    2020-08-11 2020-08-12     North    Bolo         2 malaria_tot         26
##  9 Facility 3    2020-08-11 2020-08-12     North    Dingo        3 malaria_rdt_0-4      8
## 10 Facility 3    2020-08-11 2020-08-12     North    Dingo        3 malaria_rdt_5-14     5
## # ... with 12,142 more rows

or by position:

# provide columns by position
count_data %>% 
  pivot_longer(
    cols = 6:9
  )

or by named range:

# provide range of consecutive columns
count_data %>% 
  pivot_longer(
    cols = `malaria_rdt_0-4`:malaria_tot
  )

These two new columns are given the default names of name and value, but we can override these defaults to provide more meaningful names, which can help remember what is stored within, using the names_to and values_to arguments. Let’s use the names age_group and counts:

df_long <- 
  count_data %>% 
  pivot_longer(
    cols = starts_with("malaria_"),
    names_to = "age_group",
    values_to = "counts"
  )

df_long
## # A tibble: 12,152 x 8
##    location_name data_date  submitted_date Province District newid age_group        counts
##    <chr>         <date>     <date>         <chr>    <chr>    <int> <chr>             <int>
##  1 Facility 1    2020-08-11 2020-08-12     North    Spring       1 malaria_rdt_0-4      11
##  2 Facility 1    2020-08-11 2020-08-12     North    Spring       1 malaria_rdt_5-14     12
##  3 Facility 1    2020-08-11 2020-08-12     North    Spring       1 malaria_rdt_15       23
##  4 Facility 1    2020-08-11 2020-08-12     North    Spring       1 malaria_tot          46
##  5 Facility 2    2020-08-11 2020-08-12     North    Bolo         2 malaria_rdt_0-4      11
##  6 Facility 2    2020-08-11 2020-08-12     North    Bolo         2 malaria_rdt_5-14     10
##  7 Facility 2    2020-08-11 2020-08-12     North    Bolo         2 malaria_rdt_15        5
##  8 Facility 2    2020-08-11 2020-08-12     North    Bolo         2 malaria_tot          26
##  9 Facility 3    2020-08-11 2020-08-12     North    Dingo        3 malaria_rdt_0-4       8
## 10 Facility 3    2020-08-11 2020-08-12     North    Dingo        3 malaria_rdt_5-14      5
## # ... with 12,142 more rows

We can now pass this new dataset to {ggplot2}, and map the new column count to the y-axis and new column age_group to the fill = argument (the column internal color). This will display the malaria counts in a stacked bar chart, by age group:

ggplot(data = df_long) +
  geom_col(
    mapping = aes(x = data_date, y = counts, fill = age_group),
    width = 1
  )

Examine this new plot, and compare it with the plot we created earlier - what has gone wrong?

We have encountered a common problem when wrangling surveillance data - we have also included the total counts from the malaria_tot column, so the magnitude of each bar in the plot is twice as high as it should be.

We can handle this in a number of ways. We could simply filter these totals from the dataset before we pass it to ggplot():

df_long %>% 
  filter(age_group != "malaria_tot") %>% 
  ggplot() +
  geom_col(
    aes(x = data_date, y = counts, fill = age_group),
    width = 1
  )

Alternatively, we could have excluded this variable when we ran pivot_longer(), thereby maintaining it in the dataset as a separate variable. See how its values “expand” to fill the new rows.

count_data %>% 
  pivot_longer(
    cols = `malaria_rdt_0-4`:malaria_rdt_15,   # does not include the totals column
    names_to = "age_group",
    values_to = "counts"
  )
## # A tibble: 9,114 x 9
##    location_name data_date  submitted_date Province District malaria_tot newid age_group        counts
##    <chr>         <date>     <date>         <chr>    <chr>          <int> <int> <chr>             <int>
##  1 Facility 1    2020-08-11 2020-08-12     North    Spring            46     1 malaria_rdt_0-4      11
##  2 Facility 1    2020-08-11 2020-08-12     North    Spring            46     1 malaria_rdt_5-14     12
##  3 Facility 1    2020-08-11 2020-08-12     North    Spring            46     1 malaria_rdt_15       23
##  4 Facility 2    2020-08-11 2020-08-12     North    Bolo              26     2 malaria_rdt_0-4      11
##  5 Facility 2    2020-08-11 2020-08-12     North    Bolo              26     2 malaria_rdt_5-14     10
##  6 Facility 2    2020-08-11 2020-08-12     North    Bolo              26     2 malaria_rdt_15        5
##  7 Facility 3    2020-08-11 2020-08-12     North    Dingo             18     3 malaria_rdt_0-4       8
##  8 Facility 3    2020-08-11 2020-08-12     North    Dingo             18     3 malaria_rdt_5-14      5
##  9 Facility 3    2020-08-11 2020-08-12     North    Dingo             18     3 malaria_rdt_15        5
## 10 Facility 4    2020-08-11 2020-08-12     North    Bolo              49     4 malaria_rdt_0-4      16
## # ... with 9,104 more rows

Pivoting data of multiple classes

The above example works well in situations in which all the columns you want to “pivot longer” are of the same class (character, numeric, logical…).

However, there will be many cases when, as a field epidemiologist, you will be working with data that was prepared by non-specialists and which follow their own non-standard logic - as Hadley Wickham noted (referencing Tolstoy) in his seminal article on Tidy Data principles: “Like families, tidy datasets are all alike but every messy dataset is messy in its own way.”

One particularly common problem you will encounter will be the need to pivot columns that contain different classes of data. This pivot will result in storing these different data types in a single column, which is not a good situation. There are various approaches one can take to separate out the mess this creates, but there is an important step you can take using pivot_longer() to avoid creating such a situation yourself.

Take a situation in which there have been a series of observations at different time steps for each of three items A, B and C. Examples of such items could be individuals (e.g. contacts of an Ebola case being traced each day for 21 days) or remote village health posts being monitored once per year to ensure they are still functional. Let’s use the contact tracing example. Imagine that the data are stored as follows:

As can be seen, the data are a bit complicated. Each row stores information about one item, but with the time series running further and further away to the right as time progresses. Moreover, the column classes alternate between date and character values.

One particularly bad example of this encountered by this author involved cholera surveillance data, in which 8 new columns of observations were added each day over the course of 4 years. Simply opening the Excel file in which these data were stored took >10 minuntes on my laptop!

In order to work with these data, we need to transform the data frame to long format, but keeping the separation between a date column and a character (status) column, for each observation for each item. If we don’t, we might end up with a mixture of variable types in a single column (a very big “no-no” when it comes to data management and tidy data):

df %>% 
  pivot_longer(
    cols = -id,
    names_to = c("observation")
  )
## # A tibble: 18 x 3
##    id    observation value     
##    <chr> <chr>       <chr>     
##  1 A     obs1_date   2021-04-23
##  2 A     obs1_status Healthy   
##  3 A     obs2_date   2021-04-24
##  4 A     obs2_status Healthy   
##  5 A     obs3_date   2021-04-25
##  6 A     obs3_status Unwell    
##  7 B     obs1_date   2021-04-23
##  8 B     obs1_status Healthy   
##  9 B     obs2_date   2021-04-24
## 10 B     obs2_status Healthy   
## 11 B     obs3_date   2021-04-25
## 12 B     obs3_status Healthy   
## 13 C     obs1_date   2021-04-23
## 14 C     obs1_status Missing   
## 15 C     obs2_date   2021-04-24
## 16 C     obs2_status Healthy   
## 17 C     obs3_date   2021-04-25
## 18 C     obs3_status Healthy

Above, our pivot has merged dates and characters into a single value column. R will react by converting the entire column to class character, and the utility of the dates is lost.

To prevent this situation, we can take advantage of the syntax structure of the original column names. There is a common naming structure, with the observation number, an underscore, and then either “status” or “date”. We can leverage this syntax to keep these two data types in separate columns after the pivot.

We do this by:

  • Providing a character vector to the names_to = argument, with the second item being (".value" ). This special term indicates that the pivoted columns will be split based on a character in their name…
  • You must also provide the “splitting” character to the names_sep = argument. In this case, it is the underscore "_".

Thus, the naming and split of new columns is based around the underscore in the existing variable names.

df_long <- 
  df %>% 
  pivot_longer(
    cols = -id,
    names_to = c("observation", ".value"),
    names_sep = "_"
  )

df_long
## # A tibble: 9 x 4
##   id    observation date       status 
##   <chr> <chr>       <chr>      <chr>  
## 1 A     obs1        2021-04-23 Healthy
## 2 A     obs2        2021-04-24 Healthy
## 3 A     obs3        2021-04-25 Unwell 
## 4 B     obs1        2021-04-23 Healthy
## 5 B     obs2        2021-04-24 Healthy
## 6 B     obs3        2021-04-25 Healthy
## 7 C     obs1        2021-04-23 Missing
## 8 C     obs2        2021-04-24 Healthy
## 9 C     obs3        2021-04-25 Healthy

Finishing touches:

Note that the date column is currently in character class - we can easily convert this into it’s proper date class using the mutate() and as_date() functions described in the Working with dates page.

We may also want to convert the observation column to a numeric format by dropping the “obs” prefix and converting to numeric. We cando this with str_remove_all() from the stringr package (see the Characters and strings page).

df_long <- 
  df_long %>% 
  mutate(
    date = date %>% lubridate::as_date(),
    observation = 
      observation %>% 
      str_remove_all("obs") %>% 
      as.numeric()
  )

df_long
## # A tibble: 9 x 4
##   id    observation date       status 
##   <chr>       <dbl> <date>     <chr>  
## 1 A               1 2021-04-23 Healthy
## 2 A               2 2021-04-24 Healthy
## 3 A               3 2021-04-25 Unwell 
## 4 B               1 2021-04-23 Healthy
## 5 B               2 2021-04-24 Healthy
## 6 B               3 2021-04-25 Healthy
## 7 C               1 2021-04-23 Missing
## 8 C               2 2021-04-24 Healthy
## 9 C               3 2021-04-25 Healthy

And now, we can start to work with the data in this format, e.g. by plotting a descriptive heat tile:

ggplot(data = df_long, mapping = aes(x = date, y = id, fill = status)) +
  geom_tile(colour = "black") +
  scale_fill_manual(
    values = 
      c("Healthy" = "lightgreen", 
        "Unwell" = "red", 
        "Missing" = "orange")
  )

12.3 Long-to-wide

In some instances, we may wish to convert a dataset to a wider format. For this, we can use the pivot_wider() function.

A typical use-case is when we want to transform the results of an analysis into a format which is more digestible for the reader (such as a Table for presentation). Usually, this involves transforming a dataset in which information for one subject is are spread over multiple rows into a format in which that information is stored in a single row.

Data

For this section of the page, we will use the case linelist (see the Preparation section), which contains one row per case.

Here are the first 50 rows:

Suppose that we want to know the counts of individuals in the different age groups, by gender:

df_wide <- 
  linelist %>% 
  count(age_cat, gender)

df_wide
##    age_cat gender   n
## 1      0-4      f 640
## 2      0-4      m 416
## 3      0-4   <NA>  39
## 4      5-9      f 641
## 5      5-9      m 412
## 6      5-9   <NA>  42
## 7    10-14      f 518
## 8    10-14      m 383
## 9    10-14   <NA>  40
## 10   15-19      f 359
## 11   15-19      m 364
## 12   15-19   <NA>  20
## 13   20-29      f 468
## 14   20-29      m 575
## 15   20-29   <NA>  30
## 16   30-49      f 179
## 17   30-49      m 557
## 18   30-49   <NA>  18
## 19   50-69      f   2
## 20   50-69      m  91
## 21   50-69   <NA>   2
## 22     70+      m   5
## 23     70+   <NA>   1
## 24    <NA>   <NA>  86

This gives us a long dataset that is great for producing visualisations in ggplot2, but not ideal for presentation in a table:

ggplot(df_wide) +
  geom_col(aes(x = age_cat, y = n, fill = gender))

Pivot wider

Therefore, we can use pivot_wider() to transform the data into a better format for inclusion as tables in our reports.

The argument names_from specifies the column from which to generate the new column names, while the argument values_from specifies the column from which to take the values to populate the cells. The argument id_cols = is optional, but can be provided a vector of column names that should not be pivoted, and will thus identify each row.

table_wide <- 
  df_wide %>% 
  pivot_wider(
    id_cols = age_cat,
    names_from = gender,
    values_from = n
  )

table_wide
## # A tibble: 9 x 4
##   age_cat     f     m  `NA`
##   <fct>   <int> <int> <int>
## 1 0-4       640   416    39
## 2 5-9       641   412    42
## 3 10-14     518   383    40
## 4 15-19     359   364    20
## 5 20-29     468   575    30
## 6 30-49     179   557    18
## 7 50-69       2    91     2
## 8 70+        NA     5     1
## 9 <NA>       NA    NA    86

This table is much more reader-friendly, and therefore better for inclusion in our reports. You can convert into a pretty table with several packages including flextable and knitr. This process is elaborated in the page Tables for presentation.

table_wide %>% 
  janitor::adorn_totals(c("row", "col")) %>% # adds row and column totals
  knitr::kable() %>% 
  kableExtra::row_spec(row = 10, bold = TRUE) %>% 
  kableExtra::column_spec(column = 5, bold = TRUE) 
age_cat f m NA Total
0-4 640 416 39 1095
5-9 641 412 42 1095
10-14 518 383 40 941
15-19 359 364 20 743
20-29 468 575 30 1073
30-49 179 557 18 754
50-69 2 91 2 95
70+ NA 5 1 6
NA NA NA 86 86
Total 2807 2803 278 5888

12.4 Fill

In some situations after a pivot, and more commonly after a bind, we are left with gaps in some cells that we would like to fill.

Data

For example, take two datasets, each with observations for the measurement number, the name of the facility, and the case count at that time. However, the second dataset also has a variable Year.

df1 <- 
  tibble::tribble(
       ~Measurement, ~Facility, ~Cases,
                  1,  "Hosp 1",     66,
                  2,  "Hosp 1",     26,
                  3,  "Hosp 1",      8,
                  1,  "Hosp 2",     71,
                  2,  "Hosp 2",     62,
                  3,  "Hosp 2",     70,
                  1,  "Hosp 3",     47,
                  2,  "Hosp 3",     70,
                  3,  "Hosp 3",     38,
       )

df1 
## # A tibble: 9 x 3
##   Measurement Facility Cases
##         <dbl> <chr>    <dbl>
## 1           1 Hosp 1      66
## 2           2 Hosp 1      26
## 3           3 Hosp 1       8
## 4           1 Hosp 2      71
## 5           2 Hosp 2      62
## 6           3 Hosp 2      70
## 7           1 Hosp 3      47
## 8           2 Hosp 3      70
## 9           3 Hosp 3      38
df2 <- 
  tibble::tribble(
    ~Year, ~Measurement, ~Facility, ~Cases,
     2000,            1,  "Hosp 4",     82,
     2001,            2,  "Hosp 4",     87,
     2002,            3,  "Hosp 4",     46
  )

df2
## # A tibble: 3 x 4
##    Year Measurement Facility Cases
##   <dbl>       <dbl> <chr>    <dbl>
## 1  2000           1 Hosp 4      82
## 2  2001           2 Hosp 4      87
## 3  2002           3 Hosp 4      46

When we perform a bind_rows() to join the two datasets together, the Year variable is filled with NA for those rows where there was no prior information (i.e. the first dataset):

df_combined <- 
  bind_rows(df1, df2) %>% 
  arrange(Measurement, Facility)

df_combined
## # A tibble: 12 x 4
##    Measurement Facility Cases  Year
##          <dbl> <chr>    <dbl> <dbl>
##  1           1 Hosp 1      66    NA
##  2           1 Hosp 2      71    NA
##  3           1 Hosp 3      47    NA
##  4           1 Hosp 4      82  2000
##  5           2 Hosp 1      26    NA
##  6           2 Hosp 2      62    NA
##  7           2 Hosp 3      70    NA
##  8           2 Hosp 4      87  2001
##  9           3 Hosp 1       8    NA
## 10           3 Hosp 2      70    NA
## 11           3 Hosp 3      38    NA
## 12           3 Hosp 4      46  2002

fill()

In this case, Year is a useful variable to include, particularly if we want to explore trends over time. Therefore, we use fill() to fill in those empty cells, by specifying the column to fill and the direction (in this case up):

df_combined %>% 
  fill(Year, .direction = "up")
## # A tibble: 12 x 4
##    Measurement Facility Cases  Year
##          <dbl> <chr>    <dbl> <dbl>
##  1           1 Hosp 1      66  2000
##  2           1 Hosp 2      71  2000
##  3           1 Hosp 3      47  2000
##  4           1 Hosp 4      82  2000
##  5           2 Hosp 1      26  2001
##  6           2 Hosp 2      62  2001
##  7           2 Hosp 3      70  2001
##  8           2 Hosp 4      87  2001
##  9           3 Hosp 1       8  2002
## 10           3 Hosp 2      70  2002
## 11           3 Hosp 3      38  2002
## 12           3 Hosp 4      46  2002

Alternatively, we can rearrange the data so that we would need to fill in a downward direction:

df_combined <- 
  df_combined %>% 
  arrange(Measurement, desc(Facility))

df_combined
## # A tibble: 12 x 4
##    Measurement Facility Cases  Year
##          <dbl> <chr>    <dbl> <dbl>
##  1           1 Hosp 4      82  2000
##  2           1 Hosp 3      47    NA
##  3           1 Hosp 2      71    NA
##  4           1 Hosp 1      66    NA
##  5           2 Hosp 4      87  2001
##  6           2 Hosp 3      70    NA
##  7           2 Hosp 2      62    NA
##  8           2 Hosp 1      26    NA
##  9           3 Hosp 4      46  2002
## 10           3 Hosp 3      38    NA
## 11           3 Hosp 2      70    NA
## 12           3 Hosp 1       8    NA
df_combined <- 
  df_combined %>% 
  fill(Year, .direction = "down")

df_combined
## # A tibble: 12 x 4
##    Measurement Facility Cases  Year
##          <dbl> <chr>    <dbl> <dbl>
##  1           1 Hosp 4      82  2000
##  2           1 Hosp 3      47  2000
##  3           1 Hosp 2      71  2000
##  4           1 Hosp 1      66  2000
##  5           2 Hosp 4      87  2001
##  6           2 Hosp 3      70  2001
##  7           2 Hosp 2      62  2001
##  8           2 Hosp 1      26  2001
##  9           3 Hosp 4      46  2002
## 10           3 Hosp 3      38  2002
## 11           3 Hosp 2      70  2002
## 12           3 Hosp 1       8  2002

We now have a useful dataset for plotting:

ggplot(df_combined) +
  aes(Year, Cases, fill = Facility) +
  geom_col()

But less useful for presenting in a table, so let’s practice converting this long, untidy dataframe into a wider, tidy dataframe:

df_combined %>% 
  pivot_wider(
    id_cols = c(Facility, Year, Cases),
    names_from = "Year",
    values_from = "Cases"
  ) %>% 
  arrange(Facility) %>% 
  janitor::adorn_totals(c("row", "col")) %>% 
  knitr::kable() %>% 
  kableExtra::row_spec(row = 5, bold = TRUE) %>% 
  kableExtra::column_spec(column = 5, bold = TRUE) 
Facility 2000 2001 2002 Total
Hosp 1 66 26 8 100
Hosp 2 71 62 70 203
Hosp 3 47 70 38 155
Hosp 4 82 87 46 215
Total 266 245 162 673

N.B. In this case, we had to specify to only include the three variables Facility, Year, and Cases as the additional variable Measurement would interfere with the creation of the table:

df_combined %>% 
  pivot_wider(
    names_from = "Year",
    values_from = "Cases"
  ) %>% 
  knitr::kable()
Measurement Facility 2000 2001 2002
1 Hosp 4 82 NA NA
1 Hosp 3 47 NA NA
1 Hosp 2 71 NA NA
1 Hosp 1 66 NA NA
2 Hosp 4 NA 87 NA
2 Hosp 3 NA 70 NA
2 Hosp 2 NA 62 NA
2 Hosp 1 NA 26 NA
3 Hosp 4 NA NA 46
3 Hosp 3 NA NA 38
3 Hosp 2 NA NA 70
3 Hosp 1 NA NA 8

12.5 Resources

Here is a helpful tutorial

13 Grouping data

This page covers how to group and aggregate data for descriptive analysis. It makes use of the tidyverse family of packages for common and easy-to-use functions.

Grouping data is a core component of data management and analysis. Grouped data statistically summarised by group, and can be plotted by group. Functions from the dplyr package (part of the tidyverse) make grouping and subsequent operations quite easy.

This page will address the following topics:

  • Group data with the group_by() function
  • Un-group data
  • summarise() grouped data with statistics
  • The difference between count() and tally()
  • arrange() applied to grouped data
  • filter() applied to grouped data
  • mutate() applied to grouped data
  • select() applied to grouped data
  • The base R aggregate() command as an alternative

13.1 Preparation

Load packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
  rio,       # to import data
  here,      # to locate files
  tidyverse, # to clean, handle, and plot the data (includes dplyr)
  janitor)   # adding total rows and columns

Import data

We import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). The dataset is imported using the import() function from the rio package. See the page on Import and export for various ways to import data.

linelist <- import("linelist_cleaned.rds")

The first 50 rows of linelist:

13.2 Grouping

The function group_by() from dplyr groups the rows by the unique values in the column specified to it. If multiple columns are specified, rows are grouped by the unique combinations of values across the columns. Each unique value (or combination of values) constitutes a group. Subsequent changes to the dataset or calculations can then be performed within the context of each group.

For example, the command below takes the linelist and groups the rows by unique values in the column outcome, saving the output as a new data frame ll_by_outcome. The grouping column(s) are placed inside the parentheses of the function group_by().

ll_by_outcome <- linelist %>% 
  group_by(outcome)

Note that there is no perceptible change to the dataset after running group_by(), until another dplyr verb such as mutate(), summarise(), or arrange() is applied on the “grouped” data frame.

You can however “see” the groupings by printing the data frame. When you print a grouped data frame, you will see it has been transformed into a tibble class object which, when printed, displays which groupings have been applied and how many groups there are - written just above the header row.

# print to see which groups are active
ll_by_outcome
## # A tibble: 5,888 x 30
## # Groups:   outcome [3]
##    case_id generation date_infection date_onset date_hospitalisa~ date_outcome outcome gender   age age_unit age_years age_cat age_cat5 hospital   lon   lat infector
##    <chr>        <dbl> <date>         <date>     <date>            <date>       <chr>   <chr>  <dbl> <chr>        <dbl> <fct>   <fct>    <chr>    <dbl> <dbl> <chr>   
##  1 5fe599           4 2014-05-08     2014-05-13 2014-05-15        NA           <NA>    m          2 years            2 0-4     0-4      Other    -13.2  8.47 f547d6  
##  2 8689b7           4 NA             2014-05-13 2014-05-14        2014-05-18   Recover f          3 years            3 0-4     0-4      Missing  -13.2  8.45 <NA>    
##  3 11f8ea           2 NA             2014-05-16 2014-05-18        2014-05-30   Recover m         56 years           56 50-69   55-59    St. Mar~ -13.2  8.46 <NA>    
##  4 b8812a           3 2014-05-04     2014-05-18 2014-05-20        NA           <NA>    f         18 years           18 15-19   15-19    Port Ho~ -13.2  8.48 f90f5f  
##  5 893f25           3 2014-05-18     2014-05-21 2014-05-22        2014-05-29   Recover m          3 years            3 0-4     0-4      Militar~ -13.2  8.46 11f8ea  
##  6 be99c8           3 2014-05-03     2014-05-22 2014-05-23        2014-05-24   Recover f         16 years           16 15-19   15-19    Port Ho~ -13.2  8.46 aec8ec  
##  7 07e3e8           4 2014-05-22     2014-05-27 2014-05-29        2014-06-01   Recover f         16 years           16 15-19   15-19    Missing  -13.2  8.46 893f25  
##  8 369449           4 2014-05-28     2014-06-02 2014-06-03        2014-06-07   Death   f          0 years            0 0-4     0-4      Missing  -13.2  8.46 133ee7  
##  9 f393b4           4 NA             2014-06-05 2014-06-06        2014-06-18   Recover m         61 years           61 50-69   60-64    Missing  -13.2  8.46 <NA>    
## 10 1389ca           4 NA             2014-06-05 2014-06-07        2014-06-09   Death   f         27 years           27 20-29   25-29    Missing  -13.3  8.47 <NA>    
## # ... with 5,878 more rows, and 13 more variables: source <chr>, wt_kg <dbl>, ht_cm <dbl>, ct_blood <dbl>, fever <chr>, chills <chr>, cough <chr>, aches <chr>,
## #   vomit <chr>, temp <dbl>, time_admission <chr>, bmi <dbl>, days_onset_hosp <dbl>

Unique groups

The groups created reflect each unique combination of values across the grouping columns.

To see the groups and the number of rows in each group, pass the grouped data to tally(). To see just the unique groups without counts you can pass to group_keys().

See below that there are three unique values in the grouping column outcome: “Death”, “Recover”, and NA. See that there were nrow(linelist %>% filter(outcome == "Death")) deaths, nrow(linelist %>% filter(outcome == "Recover")) recoveries, and nrow(linelist %>% filter(is.na(outcome))) with no outcome recorded.

linelist %>% 
  group_by(outcome) %>% 
  tally()
## # A tibble: 3 x 2
##   outcome     n
##   <chr>   <int>
## 1 Death    2582
## 2 Recover  1983
## 3 <NA>     1323

You can group by more than one column. Below, the data frame is grouped by outcome and gender, and then tallied. Note how each unique combination of outcome and gender is registered as its own group - including missing values for either column.

linelist %>% 
  group_by(outcome, gender) %>% 
  tally()
## # A tibble: 9 x 3
## # Groups:   outcome [3]
##   outcome gender     n
##   <chr>   <chr>  <int>
## 1 Death   f       1227
## 2 Death   m       1228
## 3 Death   <NA>     127
## 4 Recover f        953
## 5 Recover m        950
## 6 Recover <NA>      80
## 7 <NA>    f        627
## 8 <NA>    m        625
## 9 <NA>    <NA>      71

New columns

You can also create a new grouping column within the group_by() statement. This is equivalent to calling mutate() before the group_by(). For a quick tabulation this style can be handy, but for more clarity in your code consider creating this column in its own mutate() step and then piping to group_by().

# group dat based on a binary column created *within* the group_by() command
linelist %>% 
  group_by(
    age_class = ifelse(age >= 18, "adult", "child")) %>% 
  tally(sort = T)
## # A tibble: 3 x 2
##   age_class     n
##   <chr>     <int>
## 1 child      3618
## 2 adult      2184
## 3 <NA>         86

Add/drop grouping columns

By default, if you run group_by() on data that are already grouped, the old groups will be removed and the new one(s) will apply. If you want to add new groups to the existing ones, include the argument .add = TRUE.

# Grouped by outcome
by_outcome <- linelist %>% 
  group_by(outcome)

# Add grouping by gender in addition
by_outcome_gender <- by_outcome %>% 
  group_by(gender, .add = TRUE)

** Keep all groups**

If you group on a column of class factor there may be levels of the factor that are not currently present in the data. If you group on this column, by default those non-present levels are dropped and not included as groups. To change this so that all levels appear as groups (even if not present in the data), set .drop = FALSE in your group_by() command.

13.3 Un-group

Data that have been grouped will remain grouped until specifically ungrouped via ungroup(). If you forget to ungroup, it can lead to incorrect calculations! Below is an example of removing all groupings:

linelist %>% 
  group_by(outcome, gender) %>% 
  tally() %>% 
  ungroup()

You can also remove grouping for only specific columns, by placing the column name inside ungroup().

linelist %>% 
  group_by(outcome, gender) %>% 
  tally() %>% 
  ungroup(gender) # remove the grouping by gender, leave grouping by outcome

NOTE: The verb count() automatically ungroups the data after counting.

13.4 Summarise

See the dplyr section of the Descriptive tables page for a detailed description of how to produce summary tables with summarise(). Here we briefly address how its behavior changes when applied to grouped data.

The dplyr function summarise() (or summarize()) takes a data frame and converts it into a new summary data frame, with columns containing summary statistics that you define. On an ungrouped data frame, the summary statistics will be calculated from all rows. Applying summarise() to grouped data produces those summary statistics for each group.

The syntax of summarise() is such that you provide the name(s) of the new summary column(s), an equals sign, and then a statistical function to apply to the data, as shown below. For example, min(), max(), median(), or sd(). Within the statistical function, list the column to be operated on and any relevant argument (e.g. na.rm = TRUE). You can use sum() to count the number of rows that meet a logical criteria (with double equals ==).

Below is an example of summarise() applied without grouped data. The statistics returned are produced from the entire dataset.

# summary statistics on ungrouped linelist
linelist %>% 
  summarise(
    n_cases  = n(),
    mean_age = mean(age_years, na.rm=T),
    max_age  = max(age_years, na.rm=T),
    min_age  = min(age_years, na.rm=T),
    n_males  = sum(gender == "m", na.rm=T))
##   n_cases mean_age max_age min_age n_males
## 1    5888 16.01831      84       0    2803

In contrast, below is the same summarise() statement applied to grouped data. The statistics are calculated for each outcome group. Note how grouping columns will carry over into the new data frame.

# summary statistics on grouped linelist
linelist %>% 
  group_by(outcome) %>% 
  summarise(
    n_cases  = n(),
    mean_age = mean(age_years, na.rm=T),
    max_age  = max(age_years, na.rm=T),
    min_age  = min(age_years, na.rm=T),
    n_males    = sum(gender == "m", na.rm=T))
## # A tibble: 3 x 6
##   outcome n_cases mean_age max_age min_age n_males
##   <chr>     <int>    <dbl>   <dbl>   <dbl>   <int>
## 1 Death      2582     15.9      76       0    1228
## 2 Recover    1983     16.1      84       0     950
## 3 <NA>       1323     16.2      69       0     625

TIP: The summarise function works with both UK and US spelling - summarise() and summarize() call the same function.

13.5 Counts and tallies

count() and tally() provide similar functionality but are different. Read more about the distinction between tally() and count() here

tally()

tally() is shorthand for summarise(n = n()), and does not group data. Thus, to achieve grouped tallys it must follow a group_by() command. You can add sort = TRUE to see the largest groups first.

linelist %>% 
  tally()
##      n
## 1 5888
linelist %>% 
  group_by(outcome) %>% 
  tally(sort = TRUE)
## # A tibble: 3 x 2
##   outcome     n
##   <chr>   <int>
## 1 Death    2582
## 2 Recover  1983
## 3 <NA>     1323

count()

In contrast, count() does the following:

  1. applies group_by() on the specified column(s)
  2. applies summarise() and returns column n with the number of rows per group
  3. applies ungroup()
linelist %>% 
  count(outcome)
##   outcome    n
## 1   Death 2582
## 2 Recover 1983
## 3    <NA> 1323

Just like with group_by() you can create a new column within the count() command:

linelist %>% 
  count(age_class = ifelse(age >= 18, "adult", "child"), sort = T)
##   age_class    n
## 1     child 3618
## 2     adult 2184
## 3      <NA>   86

count() can be called multiple times, with the functionality “rolling up”. For example, to summarise the number of hospitals present for each gender, run the following. Note, the name of the final column is changed from default “n” for clarity (with name =).

linelist %>% 
  # produce counts by unique outcome-gender groups
  count(gender, hospital) %>% 
  # gather rows by gender (3) and count number of hospitals per gender (6)
  count(gender, name = "hospitals per gender" ) 
##   gender hospitals per gender
## 1      f                    6
## 2      m                    6
## 3   <NA>                    6

Add counts

In contrast to count() and summarise(), you can use add_count() to add a new column n with the counts of rows per group while retaining all the other data frame columns.

This means that a group’s count number, in the new column n, will be printed in each row of the group. For demonstration purposes, we add this column and then re-arrange the columns for easier viewing. See the section below on filter on group size for another example.

linelist %>% 
  as_tibble() %>%                   # convert to tibble for nicer printing 
  add_count(hospital) %>%           # add column n with counts by hospital
  select(hospital, n, everything()) # re-arrange for demo purposes
## # A tibble: 5,888 x 31
##    hospital         n case_id generation date_infection date_onset date_hospitalis~ date_outcome outcome gender   age age_unit age_years age_cat age_cat5   lon   lat
##    <chr>        <int> <chr>        <dbl> <date>         <date>     <date>           <date>       <chr>   <chr>  <dbl> <chr>        <dbl> <fct>   <fct>    <dbl> <dbl>
##  1 Other          885 5fe599           4 2014-05-08     2014-05-13 2014-05-15       NA           <NA>    m          2 years            2 0-4     0-4      -13.2  8.47
##  2 Missing       1469 8689b7           4 NA             2014-05-13 2014-05-14       2014-05-18   Recover f          3 years            3 0-4     0-4      -13.2  8.45
##  3 St. Mark's ~   422 11f8ea           2 NA             2014-05-16 2014-05-18       2014-05-30   Recover m         56 years           56 50-69   55-59    -13.2  8.46
##  4 Port Hospit~  1762 b8812a           3 2014-05-04     2014-05-18 2014-05-20       NA           <NA>    f         18 years           18 15-19   15-19    -13.2  8.48
##  5 Military Ho~   896 893f25           3 2014-05-18     2014-05-21 2014-05-22       2014-05-29   Recover m          3 years            3 0-4     0-4      -13.2  8.46
##  6 Port Hospit~  1762 be99c8           3 2014-05-03     2014-05-22 2014-05-23       2014-05-24   Recover f         16 years           16 15-19   15-19    -13.2  8.46
##  7 Missing       1469 07e3e8           4 2014-05-22     2014-05-27 2014-05-29       2014-06-01   Recover f         16 years           16 15-19   15-19    -13.2  8.46
##  8 Missing       1469 369449           4 2014-05-28     2014-06-02 2014-06-03       2014-06-07   Death   f          0 years            0 0-4     0-4      -13.2  8.46
##  9 Missing       1469 f393b4           4 NA             2014-06-05 2014-06-06       2014-06-18   Recover m         61 years           61 50-69   60-64    -13.2  8.46
## 10 Missing       1469 1389ca           4 NA             2014-06-05 2014-06-07       2014-06-09   Death   f         27 years           27 20-29   25-29    -13.3  8.47
## # ... with 5,878 more rows, and 14 more variables: infector <chr>, source <chr>, wt_kg <dbl>, ht_cm <dbl>, ct_blood <dbl>, fever <chr>, chills <chr>, cough <chr>,
## #   aches <chr>, vomit <chr>, temp <dbl>, time_admission <chr>, bmi <dbl>, days_onset_hosp <dbl>

Add totals

To easily add total sum rows or columns after using tally() or count(), see the janitor section of the Descriptive tables page. This package offers functions like adorn_totals() and adorn_percentages() to add totals and convert to show percentages. Below is a brief example:

linelist %>%                                  # case linelist
  tabyl(age_cat, gender) %>%                  # cross-tabulate counts of two columns
  adorn_totals(where = "row") %>%             # add a total row
  adorn_percentages(denominator = "col") %>%  # convert to proportions with column denominator
  adorn_pct_formatting() %>%                  # convert proportions to percents
  adorn_ns(position = "front") %>%            # display as: "count (percent)"
  adorn_title(                                # adjust titles
    row_name = "Age Category",
    col_name = "Gender")
##                      Gender                           
##  Age Category             f             m          NA_
##           0-4  640  (22.8%)  416  (14.8%)  39  (14.0%)
##           5-9  641  (22.8%)  412  (14.7%)  42  (15.1%)
##         10-14  518  (18.5%)  383  (13.7%)  40  (14.4%)
##         15-19  359  (12.8%)  364  (13.0%)  20   (7.2%)
##         20-29  468  (16.7%)  575  (20.5%)  30  (10.8%)
##         30-49  179   (6.4%)  557  (19.9%)  18   (6.5%)
##         50-69    2   (0.1%)   91   (3.2%)   2   (0.7%)
##           70+    0   (0.0%)    5   (0.2%)   1   (0.4%)
##          <NA>    0   (0.0%)    0   (0.0%)  86  (30.9%)
##         Total 2807 (100.0%) 2803 (100.0%) 278 (100.0%)

To add more complex totals rows that involve summary statistics other than sums, see this section of the Descriptive Tables page.

13.6 Grouping by date

When grouping data by date, you must have (or create) a column for the date unit of interest - for example “day”, “epiweek”, “month”, etc. You can make this column using floor_date() from lubridate, as explained in the Epidemiological weeks section of the Working with dates page. Once you have this column, you can use count() from dplyr to group the rows by those unique date values and achieve aggregate counts.

One additional step common for date situations, is to “fill-in” any dates in the sequence that are not present in the data. Use complete() from tidyr so that the aggregated date series is complete including all possible date units within the range. Without this step, a week with no cases reported might not appear in your data!

Within complete() you re-define your date column as a sequence of dates seq.Date() from the minimum to the maximum - thus the dates are expanded. By default, the case count values in any new “expanded” rows will be NA. You can set them to 0 using the fill = argument of complete(), which expects a named list (if your counts column is named n, provide fill = list(n = 0). See ?complete for details and the Working with dates page for an example.

Linelist cases into days

Here is an example of grouping cases into days without using complete(). Note the first rows skip over dates with no cases.

daily_counts <- linelist %>% 
  drop_na(date_onset) %>%        # remove that were missing date_onset
  count(date_onset)              # count number of rows per unique date

Below we add the complete() command to ensure every day in the range is represented.

daily_counts <- linelist %>% 
  drop_na(date_onset) %>%                 # remove case missing date_onset
  count(date_onset) %>%                   # count number of rows per unique date
  complete(                               # ensure all days appear even if no cases
    date_onset = seq.Date(                # re-define date colume as daily sequence of dates
      from = min(date_onset, na.rm=T), 
      to = max(date_onset, na.rm=T),
      by = "day"),
    fill = list(n = 0))                   # set new filled-in rows to display 0 in column n (not NA as default) 

Linelist cases into weeks

The same principle can be applied for weeks. First create a new column that is the week of the case using floor_date() with unit = "week". Then, use count() as above to achieve weekly case counts. Finish with complete() to ensure that all weeks are represented, even if they contain no cases.

# Make dataset of weekly case counts
weekly_counts <- linelist %>% 
  drop_na(date_onset) %>%                 # remove cases missing date_onset
  mutate(week = lubridate::floor_date(date_onset, unit = "week")) %>%  # new column of week of onset
  count(week) %>%                         # group data by week and count rows per group
  complete(                               # ensure all days appear even if no cases
    week = seq.Date(                      # re-define date colume as daily sequence of dates
      from = min(week, na.rm=T), 
      to = max(week, na.rm=T),
      by = "week"),
    fill = list(n = 0))                   # set new filled-in rows to display 0 in column n (not NA as default) 

Here are the first 50 rows of the resulting data frame:

Linelist cases into months

To aggregate cases into months, again use floor_date() from the lubridate package, but with the argument unit = "months". This rounds each date down to the 1st of its month. The output will be class Date. Note that in the complete() step we also use by = "months".

# Make dataset of monthly case counts
monthly_counts <- linelist %>% 
  drop_na(date_onset) %>% 
  mutate(month = lubridate::floor_date(date_onset, unit = "months")) %>%  # new column, 1st of month of onset
  count(month) %>%                          # count cases by month
  complete(
    month = seq.Date(
      min(month, na.rm=T),     # include all months with no cases reported
      max(month, na.rm=T),
      by="month"),
    fill = list(n = 0))

Daily counts into weeks

To aggregate daily counts into weekly counts, use floor_date() as above. However, use group_by() and summarize() instead of count() because you need to sum() daily case counts instead of just counting the number of rows per week.

Daily counts into months

To aggregate daily counts into months counts, use floor_date() with unit = "month" as above. However, use group_by() and summarize() instead of count() because you need to sum() daily case counts instead of just counting the number of rows per month.

13.7 Arranging grouped data

Using the dplyr verb arrange() to order the rows in a data frame behaves the same when the data are grouped, unless you set the argument .by_group =TRUE. In this case the rows are ordered first by the grouping columns and then by any other columns you specify to arrange().

13.8 Filter on grouped data

filter()

When applied in conjunction with functions that evaluate the data frame (like max(), min(), mean()), these functions will now be applied to the groups. For example, if you want to filter and keep rows where patients are above the median age, this will now apply per group - filtering to keep rows above the group’s median age.

Slice rows per group

The dplyr function slice(), which filters rows based on their position in the data, can also be applied per group. Remember to account for sorting the data within each group to get the desired “slice”.

For example, to retrieve only the latest 5 admissions from each hospital:

  1. Group the linelist by column hospital
  2. Arrange the records from latest to earliest date_hospitalisation within each hospital group
  3. Slice to retrieve the first 5 rows from each hospital
linelist %>%
  group_by(hospital) %>%
  arrange(hospital, date_hospitalisation) %>%
  slice_head(n = 5) %>% 
  arrange(hospital) %>%                            # for display
  select(case_id, hospital, date_hospitalisation)  # for display
## # A tibble: 30 x 3
## # Groups:   hospital [6]
##    case_id hospital          date_hospitalisation
##    <chr>   <chr>             <date>              
##  1 20b688  Central Hospital  2014-05-06          
##  2 d58402  Central Hospital  2014-05-10          
##  3 b8f2fd  Central Hospital  2014-05-13          
##  4 acf422  Central Hospital  2014-05-28          
##  5 275cc7  Central Hospital  2014-05-28          
##  6 d1fafd  Military Hospital 2014-04-17          
##  7 974bc1  Military Hospital 2014-05-13          
##  8 6a9004  Military Hospital 2014-05-13          
##  9 09e386  Military Hospital 2014-05-14          
## 10 865581  Military Hospital 2014-05-15          
## # ... with 20 more rows

slice_head() - selects n rows from the top
slice_tail() - selects n rows from the end
slice_sample() - randomly selects n rows
slice_min() - selects n rows with highest values in order_by = column, use with_ties = TRUE to keep ties
slice_max() - selects n rows with lowest values in order_by = column, use with_ties = TRUE to keep ties

See the De-duplication page for more examples and detail on slice().

Filter on group size

The function add_count() adds a column n to the original data giving the number of rows in that row’s group.

Shown below, add_count() is applied to the column hospital, so the values in the new column n reflect the number of rows in that row’s hospital group. Note how values in column n are repeated. In the example below, the column name n could be changed using name = within add_count(). For demonstration purposes we re-arrange the columns with select().

linelist %>% 
  as_tibble() %>% 
  add_count(hospital) %>%          # add "number of rows admitted to same hospital as this row" 
  select(hospital, n, everything())
## # A tibble: 5,888 x 31
##    hospital         n case_id generation date_infection date_onset date_hospitalis~ date_outcome outcome gender   age age_unit age_years age_cat age_cat5   lon   lat
##    <chr>        <int> <chr>        <dbl> <date>         <date>     <date>           <date>       <chr>   <chr>  <dbl> <chr>        <dbl> <fct>   <fct>    <dbl> <dbl>
##  1 Other          885 5fe599           4 2014-05-08     2014-05-13 2014-05-15       NA           <NA>    m          2 years            2 0-4     0-4      -13.2  8.47
##  2 Missing       1469 8689b7           4 NA             2014-05-13 2014-05-14       2014-05-18   Recover f          3 years            3 0-4     0-4      -13.2  8.45
##  3 St. Mark's ~   422 11f8ea           2 NA             2014-05-16 2014-05-18       2014-05-30   Recover m         56 years           56 50-69   55-59    -13.2  8.46
##  4 Port Hospit~  1762 b8812a           3 2014-05-04     2014-05-18 2014-05-20       NA           <NA>    f         18 years           18 15-19   15-19    -13.2  8.48
##  5 Military Ho~   896 893f25           3 2014-05-18     2014-05-21 2014-05-22       2014-05-29   Recover m          3 years            3 0-4     0-4      -13.2  8.46
##  6 Port Hospit~  1762 be99c8           3 2014-05-03     2014-05-22 2014-05-23       2014-05-24   Recover f         16 years           16 15-19   15-19    -13.2  8.46
##  7 Missing       1469 07e3e8           4 2014-05-22     2014-05-27 2014-05-29       2014-06-01   Recover f         16 years           16 15-19   15-19    -13.2  8.46
##  8 Missing       1469 369449           4 2014-05-28     2014-06-02 2014-06-03       2014-06-07   Death   f          0 years            0 0-4     0-4      -13.2  8.46
##  9 Missing       1469 f393b4           4 NA             2014-06-05 2014-06-06       2014-06-18   Recover m         61 years           61 50-69   60-64    -13.2  8.46
## 10 Missing       1469 1389ca           4 NA             2014-06-05 2014-06-07       2014-06-09   Death   f         27 years           27 20-29   25-29    -13.3  8.47
## # ... with 5,878 more rows, and 14 more variables: infector <chr>, source <chr>, wt_kg <dbl>, ht_cm <dbl>, ct_blood <dbl>, fever <chr>, chills <chr>, cough <chr>,
## #   aches <chr>, vomit <chr>, temp <dbl>, time_admission <chr>, bmi <dbl>, days_onset_hosp <dbl>

It then becomes easy to filter for case rows who were hospitalized at a “small” hospital, say, a hospital that admitted fewer than 500 patients:

linelist %>% 
  add_count(hospital) %>% 
  filter(n < 500)

13.9 Mutate on grouped data

To retain all columns and rows (not summarise) and add a new column containing group statistics, use mutate() after group_by() instead of summarise().

This is useful if you want group statistics in the original dataset with all other columns present - e.g. for calculations that compare one row to its group.

For example, this code below calculates the difference between a row’s delay-to-admission and the median delay for their hospital. The steps are:

  1. Group the data by hospital
  2. Use the column days_onset_hosp (delay to hospitalisation) to create a new column containing the mean delay at the hospital of that row
  3. Calculate the difference between the two columns

We select() only certain columns to display, for demonstration purposes.

linelist %>% 
  # group data by hospital (no change to linelist yet)
  group_by(hospital) %>% 
  
  # new columns
  mutate(
    # mean days to admission per hospital (rounded to 1 decimal)
    group_delay_admit = round(mean(days_onset_hosp, na.rm=T), 1),
    
    # difference between row's delay and mean delay at their hospital (rounded to 1 decimal)
    diff_to_group     = round(days_onset_hosp - group_delay_admit, 1)) %>%
  
  # select certain rows only - for demonstration/viewing purposes
  select(case_id, hospital, days_onset_hosp, group_delay_admit, diff_to_group)
## # A tibble: 5,888 x 5
## # Groups:   hospital [6]
##    case_id hospital                             days_onset_hosp group_delay_admit diff_to_group
##    <chr>   <chr>                                          <dbl>             <dbl>         <dbl>
##  1 5fe599  Other                                              2               2             0  
##  2 8689b7  Missing                                            1               2.1          -1.1
##  3 11f8ea  St. Mark's Maternity Hospital (SMMH)               2               2.1          -0.1
##  4 b8812a  Port Hospital                                      2               2.1          -0.1
##  5 893f25  Military Hospital                                  1               2.1          -1.1
##  6 be99c8  Port Hospital                                      1               2.1          -1.1
##  7 07e3e8  Missing                                            2               2.1          -0.1
##  8 369449  Missing                                            1               2.1          -1.1
##  9 f393b4  Missing                                            1               2.1          -1.1
## 10 1389ca  Missing                                            2               2.1          -0.1
## # ... with 5,878 more rows

13.10 Select on grouped data

The verb select() works on grouped data, but the grouping columns are always included (even if not mentioned in select()). If you do not want these grouping columns, use ungroup() first.

13.11 Resources

Here are some useful resources for more information:

You can perform any summary function on grouped data; see the RStudio data transformation cheat sheet

The Data Carpentry page on dplyr
The tidyverse reference pages on group_by() and grouping

This page on Data manipulation

Summarize with conditions in dplyr

14 Joining data

Above: an animated example of a left join (image source)

This page describes ways to “join”, “match”, “link” “bind”, and otherwise combine data frames.

It is uncommon that your epidemiological analysis or workflow does not involve multiple sources of data, and the linkage of multiple datasets. Perhaps you need to connect laboratory data to patient clinical outcomes, or Google mobility data to infectious disease trends, or even a dataset at one stage of analysis to a transformed version of itself.

In this page we demonstrate code to:

  • Conduct joins of two data frames such that rows are matched based on common values in identifier columns
  • Join two data frames based on probabilistic (likely) matches between values
  • Expand a data frame by directly binding or (“appending”) rows or columns from another data frame

14.1 Preparation

Load packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
  rio,            # import and export
  here,           # locate files 
  tidyverse,      # data management and visualisation
  RecordLinkage,  # probabilistic matches
  fastLink        # probabilistic matches
)

Import data

To begin, we import the cleaned linelist of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import data with the import() function from the rio package (it handles many file types like .xlsx, .csv, .rds - see the Import and export page for details).

# import case linelist 
linelist <- import("linelist_cleaned.rds")

The first 50 rows of the linelist are displayed below.

Example datasets

In the joining section below, we will use the following datasets:

  1. A “miniature” version of the case linelist, containing only the columns case_id, date_onset, and hospital, and only the first 10 rows
  2. A separate data frame named hosp_info, which contains more details about each hospital

In the section on probabilistic matching, we will use two different small datasets. The code to create those datasets is given in that section.

“Miniature” case linelist

Below is the the miniature case linelist, which contains only 10 rows and only columns case_id, date_onset, and hospital.

linelist_mini <- linelist %>%                 # start with original linelist
  select(case_id, date_onset, hospital) %>%   # select columns
  head(10)                                    # only take the first 10 rows

Hospital information data frame

Below is the code to create a separate data frame with additional information about seven hospitals (the catchment population, and the level of care available). Note that the name “Military Hospital” belongs to two different hospitals - one a primary level serving 10000 residents and the other a secondary level serving 50280 residents.

# Make the hospital information data frame
hosp_info = data.frame(
  hosp_name     = c("central hospital", "military", "military", "port", "St. Mark's", "ignace", "sisters"),
  catchment_pop = c(1950280, 40500, 10000, 50280, 12000, 5000, 4200),
  level         = c("Tertiary", "Secondary", "Primary", "Secondary", "Secondary", "Primary", "Primary")
)

Here is this data frame:

Pre-cleaning

Traditional joins (non-probabilistic) are case-sensitive and require exact character matches between values in the two data frames. To demonstrate some of the cleaning steps you might need to do before initiating a join, we will clean and align the linelist_mini and hosp_info datasets now.

Identify differences

We need the values of the hosp_name column in the hosp_info data frame to match the values of the hospital column in the linelist_mini data frame.

Here are the values in the linelist_mini data frame, printed with the base R function unique():

unique(linelist_mini$hospital)
## [1] "Other"                                "Missing"                              "St. Mark's Maternity Hospital (SMMH)" "Port Hospital"                       
## [5] "Military Hospital"

and here are the values in the hosp_info data frame:

unique(hosp_info$hosp_name)
## [1] "central hospital" "military"         "port"             "St. Mark's"       "ignace"           "sisters"

You can see that while some of the hospitals exist in both data frames, there are many differences in spelling.

Align values

We begin by cleaning the values in the hosp_info data frame. As explained in the Cleaning data and core functions page, we can re-code values with logical criteria using dplyr’s case_when() function. For the four hospitals that exist in both data frames we change the values to align with the values in linelist_mini. The other hospitals we leave the values as they are (TRUE ~ hosp_name).

CAUTION: Typically when cleaning one should create a new column (e.g. hosp_name_clean), but for ease of demonstration we show modification of the old column

hosp_info <- hosp_info %>% 
  mutate(
    hosp_name = case_when(
      # criteria                         # new value
      hosp_name == "military"          ~ "Military Hospital",
      hosp_name == "port"              ~ "Port Hospital",
      hosp_name == "St. Mark's"        ~ "St. Mark's Maternity Hospital (SMMH)",
      hosp_name == "central hospital"  ~ "Central Hospital",
      TRUE                             ~ hosp_name
      )
    )

The hospital names that appear in both data frames are aligned. There are two hospitals in hosp_info that are not present in linelist_mini - we will deal with these later, in the join.

unique(hosp_info$hosp_name)
## [1] "Central Hospital"                     "Military Hospital"                    "Port Hospital"                        "St. Mark's Maternity Hospital (SMMH)"
## [5] "ignace"                               "sisters"

Prior to a join, it is often easiest to convert a column to all lowercase or all uppercase. If you need to convert all values in a column to UPPER or lower case, use mutate() and wrap the column with one of these functions from stringr, as shown in the page on Characters and strings.

str_to_upper()
str_to_upper()
str_to_title()

14.2 dplyr joins

The dplyr package offers several different join functions. dplyr is included in the tidyverse package. These join functions are described below, with simple use cases.

Many thanks to https://github.com/gadenbuie for the informative gifs!

General syntax

The join commands can be run as standalone commands to join two data frames into a new object, or they can be used within a pipe chain (%>%) to merge one data frame into another as it is being cleaned or otherwise modified.

In the example below, the function left_join() is used as a standalone command to create the a new joined_data data frame. The inputs are data frames 1 and 2 (df1 and df2). The first data frame listed is the baseline data frame, and the second one listed is joined to it.

The third argument by = is where you specify the columns in each data frame that will be used to aligns the rows in the two data frames. If the names of these columns are different, provide them within a c() vector as shown below, where the rows are matched on the basis of common values between the column ID in df1 and the column identifier in df2.

# Join based on common values between column "ID" (first data frame) and column "identifier" (second data frame)
joined_data <- left_join(df1, df2, by = c("ID" = "identifier"))

If the by columns in both data frames have the exact same name, you can just provide this one name, within quotes.

# Joint based on common values in column "ID" in both data frames
joined_data <- left_join(df1, df2, by = "ID")

If you are joining the data frames based on common values across multiple fields, list these fields within the c() vector. This example joins rows if the values in three columns in each dataset align exactly.

# join based on same first name, last name, and age
joined_data <- left_join(df1, df2, by = c("name" = "firstname", "surname" = "lastname", "Age" = "age"))

The join commands can also be run within a pipe chain. This will modify the data frame being piped.

In the example below, df1 is is passed through the pipes, df2 is joined to it, and df is thus modified and re-defined.

df1 <- df1 %>%
  filter(date_onset < as.Date("2020-03-05")) %>% # miscellaneous cleaning 
  left_join(df2, by = c("ID" = "identifier"))    # join df2 to df1

CAUTION: Joins are case-specific! Therefore it is useful to convert all values to lowercase or uppercase prior to joining. See the page on characters/strings.

Left and right joins

A left or right join is commonly used to add information to a data frame - new information is added only to rows that already existed in the baseline data frame. These are common joins in epidemiological work as they are used to add information from one dataset into another.

In using these joins, the written order of the data frames in the command is important*.

  • In a left join, the first data frame written is the baseline
  • In a right join, the second data frame written is the baseline

All rows of the baseline data frame are kept. Information in the other (secondary) data frame is joined to the baseline data frame only if there is a match via the identifier column(s). In addition:

  • Rows in the secondary data frame that do not match are dropped.
  • If there are many baseline rows that match to one row in the secondary data frame (many-to-one), the secondary information is added to each matching baseline row.
  • If a baseline row matches to multiple rows in the secondary data frame (one-to-many), all combinations are given, meaning new rows may be added to your returned data frame!

Animated examples of left and right joins (image source)

Example

Below is the output of a left_join() of hosp_info (secondary data frame, view here) into linelist_mini (baseline data frame, view here). The original linelist_mini has nrow(linelist_mini) rows. The modified linelist_mini is displayed. Note the following:

  • Two new columns, catchment_pop and level have been added on the left side of linelist_mini
  • All original rows of the baseline data frame linelist_mini are kept
  • Any original rows of linelist_mini for “Military Hospital” are duplicated because it matched to two rows in the secondary data frame, so both combinations are returned
  • The join identifier column of the secondary dataset (hosp_name) has disappeared because it is redundant with the identifier column in the primary dataset (hospital)
  • When a baseline row did not match to any secondary row (e.g. when hospital is “Other” or “Missing”), NA (blank) fills in the columns from the secondary data frame
  • Rows in the secondary data frame with no match to the baseline data frame (“sisters” and “ignace” hospitals) were dropped
linelist_mini %>% 
  left_join(hosp_info, by = c("hospital" = "hosp_name"))

“Should I use a right join, or a left join?”

To answer the above question, ask yourself “which data frame should retain all of its rows?” - use this one as the baseline. A left join keep all the rows in the first data frame written in the command, whereas a right join keeps all the rows in the second data frame.

The two commands below achieve the same output - 10 rows of hosp_info joined into a linelist_mini baseline, but they use different joins. The result is that the column order will differ based on whether hosp_info arrives from the right (in the left join) or arrives from the left (in the right join). The order of the rows may also shift accordingly. But both of these consequences can be subsequently addressed, using select() to re-order columns or arrange() to sort rows.

# The two commands below achieve the same data, but with differently ordered rows and columns
left_join(linelist_mini, hosp_info, by = c("hospital" = "hosp_name"))
right_join(hosp_info, linelist_mini, by = c("hosp_name" = "hospital"))

Here is the result of hosp_info into linelist_mini via a left join (new columns incoming from the right)

Here is the result of hosp_info into linelist_mini via a right join (new columns incoming from the left)

Also consider whether your use-case is within a pipe chain (%>%). If the dataset in the pipes is the baseline, you will likely use a left join to add data to it.

Full join

A full join is the most inclusive of the joins - it returns all rows from both data frames.

If there are any rows present in one and not the other (where no match was found), the data frame will include them and become longer. NA missing values are used to fill-in any gaps created. As you join, watch the number of columns and rows carefully to troubleshoot case-sensitivity and exact character matches.

The “baseline” data frame is the one written first in the command. Adjustment of this will not impact which records are returned by the join, but it can impact the resulting column order, row order, and which identifier columns are retained.

Animated example of a full join (image source)

Example

Below is the output of a full_join() of hosp_info (originally nrow(hosp_info), view here) into linelist_mini (originally nrow(linelist_mini), view here). Note the following:

  • All baseline rows are kept (linelist_mini)
  • Rows in the secondary that do not match to the baseline are kept (“ignace” and “sisters”), with values in the corresponding baseline columns case_id and onset filled in with missing values
  • Likewise, rows in the baseline data frame that do not match to the secondary (“Other” and “Missing”) are kept, with secondary columns catchment_pop and level filled-in with missing values
  • In the case of one-to-many or many-to-one matches (e.g. rows for “Military Hospital”), all possible combinations are returned (lengthening the final data frame)
  • Only the identifier column from the baseline is kept (hospital)
linelist_mini %>% 
  full_join(hosp_info, by = c("hospital" = "hosp_name"))

Inner join

An inner join is the most restrictive of the joins - it returns only rows with matches across both data frames.
This means that the number of rows in the baseline data frame may actually reduce. Adjustment of which data frame is the “baseline” (written first in the function) will not impact which rows are returned, but it will impact the column order, row order, and which identifier columns are retained.

Animated example of an inner join (image source)

Example

Below is the output of an inner_join() of linelist_mini (baseline) with hosp_info (secondary). Note the following:

  • Baseline rows with no match to the secondary data are removed (rows where hospital is “Missing” or “Other”)
  • Likewise, rows from the secondary data frame that had no match in the baseline are removed (rows where hosp_name is “sisters” or “ignace”)
  • Only the identifier column from the baseline is kept (hospital)
linelist_mini %>% 
  inner_join(hosp_info, by = c("hospital" = "hosp_name"))

Semi join

A semi join is a “filtering join” which uses another dataset not to add rows or columns, but to perform filtering.

A semi-join keeps all observations in the baseline data frame that have a match in the secondary data frame (but does not add new columns nor duplicate any rows for multiple matches). Read more about these “filtering” joins here.

Animated example of a semi join (image source)

As an example, the below code returns rows from the hosp_info data frame that have matches in linelist_mini based on hospital name.

hosp_info %>% 
  semi_join(linelist_mini, by = c("hosp_name" = "hospital"))
##                              hosp_name catchment_pop     level
## 1                    Military Hospital         40500 Secondary
## 2                    Military Hospital         10000   Primary
## 3                        Port Hospital         50280 Secondary
## 4 St. Mark's Maternity Hospital (SMMH)         12000 Secondary

Anti join

The anti join is another “filtering join” that returns rows in the baseline data frame that do not have a match in the secondary data frame.

Read more about filtering joins here.

Common scenarios for an anti-join include identifying records not present in another data frame, troubleshooting spelling in a join (reviewing records that should have matched), and examining records that were excluded after another join.

As with right_join() and left_join(), the baseline data frame (listed first) is important. The returned rows are from the baseline data frame only. Notice in the gif below that row in the secondary data frame (purple row 4) is not returned even though it does not match with the baseline.

Animated example of an anti join (image source)

Simple anti_join() example

For a simple example, let’s find the hosp_info hospitals that do not have any cases present in linelist_mini. We list hosp_info first, as the baseline data frame. The hospitals which are not present in linelist_mini are returned.

hosp_info %>% 
  anti_join(linelist_mini, by = c("hosp_name" = "hospital"))

Complex anti_join() example

For another example, let us say we ran an inner_join() between linelist_mini and hosp_info. This returns only a subset of the original linelist_mini records, as some are not present in hosp_info.

linelist_mini %>% 
  inner_join(hosp_info, by = c("hospital" = "hosp_name"))

To review the linelist_mini records that were excluded during the inner join, we can run an anti-join with the same settings (linelist_mini as the baseline).

linelist_mini %>% 
  anti_join(hosp_info, by = c("hospital" = "hosp_name"))

To see the hosp_info records that were excluded in the inner join, we could also run an anti-join with hosp_info as the baseline data frame.

14.3 Probabalistic matching

If you do not have a unique identifier common across datasets to join on, consider using a probabilistic matching algorithm. This would find matches between records based on similarity (e.g. Jaro–Winkler string distance, or numeric distance). Below is a simple example using the package fastLink .

Load packages

pacman::p_load(
  tidyverse,      # data manipulation and visualization
  fastLink        # record matching
  )

Here are two small example datasets that we will use to demonstrate the probabilistic matching (cases and test_results):

Here is the code used to make the datasets:

# make datasets

cases <- tribble(
  ~gender, ~first,      ~middle,     ~last,        ~yr,   ~mon, ~day, ~district,
  "M",     "Amir",      NA,          "Khan",       1989,  11,   22,   "River",
  "M",     "Anthony",   "B.",        "Smith",      1970, 09, 19,      "River", 
  "F",     "Marialisa", "Contreras", "Rodrigues",  1972, 04, 15,      "River",
  "F",     "Elizabeth", "Casteel",   "Chase",      1954, 03, 03,      "City",
  "M",     "Jose",      "Sanchez",   "Lopez",      1996, 01, 06,      "City",
  "F",     "Cassidy",   "Jones",      "Davis",     1980, 07, 20,      "City",
  "M",     "Michael",   "Murphy",     "O'Calaghan",1969, 04, 12,      "Rural", 
  "M",     "Oliver",    "Laurent",    "De Bordow" , 1971, 02, 04,     "River",
  "F",      "Blessing",  NA,          "Adebayo",   1955,  02, 14,     "Rural"
)

results <- tribble(
  ~gender,  ~first,     ~middle,     ~last,          ~yr, ~mon, ~day, ~district, ~result,
  "M",      "Amir",     NA,          "Khan",         1989, 11,   22,  "River", "positive",
  "M",      "Tony",   "B",         "Smith",          1970, 09,   19,  "River", "positive",
  "F",      "Maria",    "Contreras", "Rodriguez",    1972, 04,   15,  "Cty",   "negative",
  "F",      "Betty",    "Castel",   "Chase",        1954,  03,   30,  "City",  "positive",
  "F",      "Andrea",   NA,          "Kumaraswamy",  2001, 01,   05,  "Rural", "positive",      
  "F",      "Caroline", NA,          "Wang",         1988, 12,   11,  "Rural", "negative",
  "F",      "Trang",    NA,          "Nguyen",       1981, 06,   10,  "Rural", "positive",
  "M",      "Olivier" , "Laurent",   "De Bordeaux",  NA,   NA,   NA,  "River", "positive",
  "M",      "Mike",     "Murphy",    "O'Callaghan",  1969, 04,   12,  "Rural", "negative",
  "F",      "Cassidy",  "Jones",     "Davis",        1980, 07,   02,  "City",  "positive",
  "M",      "Mohammad", NA,          "Ali",          1942, 01,   17,  "City",  "negative",
  NA,       "Jose",     "Sanchez",   "Lopez",        1995, 01,   06,  "City",  "negative",
  "M",      "Abubakar", NA,          "Abullahi",     1960, 01,   01,  "River", "positive",
  "F",      "Maria",    "Salinas",   "Contreras",    1955, 03,   03,  "River", "positive"
  )

The cases dataset has 9 records of patients who are awaiting test results.

The test_results dataset has 14 records and contains the column result, which we want to add to the records in cases based on probabilistic matching of records.

Probabilistic matching

The fastLink() function from the fastLink package can be used to apply a matching algorithm. Here is the basic information. You can read more detail by entering ?fastLink in your console.

  • Define the two data frames for comparison to arguments dfA = and dfB =
  • In varnames = give all column names to be used for matching. They must all exist in both dfA and dfB.
  • In stringdist.match = give columns from those in varnames to be evaluated on string “distance”.
  • In numeric.match = give columns from those in varnames to be evaluated on numeric distance.
  • Missing values are ignored
  • By default, each row in either data frame is matched to at most one row in the other data frame. If you want to see all the evaluated matches, set dedupe.matches = FALSE. The deduplication is done using Winkler’s linear assignment solution.

Tip: split one date column into three separate numeric columns using day(), month(), and year() from lubridate package

The default threshold for matches is 0.94 (threshold.match =) but you can adjust it higher or lower. If you define the threshold, consider that higher thresholds could yield more false-negatives (rows that do not match which actually should match) and likewise a lower threshold could yield more false-positive matches.

Below, the data are matched on string distance across the name and district columns, and on numeric distance for year, month, and day of birth. A match threshold of 95% probability is set.

fl_output <- fastLink::fastLink(
  dfA = cases,
  dfB = results,
  varnames = c("gender", "first", "middle", "last", "yr", "mon", "day", "district"),
  stringdist.match = c("first", "middle", "last", "district"),
  numeric.match = c("yr", "mon", "day"),
  threshold.match = 0.95)
## 
## ==================== 
## fastLink(): Fast Probabilistic Record Linkage
## ==================== 
## 
## If you set return.all to FALSE, you will not be able to calculate a confusion table as a summary statistic.
## Calculating matches for each variable.
## Getting counts for parameter estimation.
##     Parallelizing calculation using OpenMP. 1 threads out of 12 are used.
## Running the EM algorithm.
## Getting the indices of estimated matches.
##     Parallelizing calculation using OpenMP. 1 threads out of 12 are used.
## Deduping the estimated matches.
## Getting the match patterns for each estimated match.

Review matches

We defined the object returned from fastLink() as fl_output. It is of class list, and it actually contains several data frames within it, detailing the results of the matching. One of these data frames is matches, which contains the most likely matches across cases and results. You can access this “matches” data frame with fl_output$matches. Below, it is saved as my_matches for ease of accessing later.

When my_matches is printed, you see two column vectors: the pairs of row numbers/indices (also called “rownames”) in cases (“inds.a”) and in results (“inds.b”) representing the best matches. If a row number from a datafrane is missing, then no match was found in the other data frame at the specified match threshold.

# print matches
my_matches <- fl_output$matches
my_matches
##   inds.a inds.b
## 1      1      1
## 2      2      2
## 3      3      3
## 4      4      4
## 5      8      8
## 6      7      9
## 7      6     10
## 8      5     12

Things to note:

  • Matches occurred despite slight differences in name spelling and dates of birth:
    • “Tony B. Smith” matched to “Anthony B Smith”
    • “Maria Rodriguez” matched to “Marialisa Rodrigues”
    • “Betty Chase” matched to “Elizabeth Chase”
    • “Olivier Laurent De Bordeaux” matched to “Oliver Laurent De Bordow” (missing date of birth ignored)
  • One row from cases (for “Blessing Adebayo”, row 9) had no good match in results, so it is not present in my_matches.

Join based on the probabilistic matches

To use these matches to join results to cases, one strategy is:

  1. Use left_join() to join my_matches to cases (matching rownames in cases to “inds.a” in my_matches)
  2. Then use another left_join() to join results to cases (matching the newly-acquired “inds.b” in cases to rownames in results)

Before the joins, we should clean the three data frames:

  • Both dfA and dfB should have their row numbers (“rowname”) converted to a proper column.
  • Both the columns in my_matches are converted to class character, so they can be joined to the character rownames
# Clean data prior to joining
#############################

# convert cases rownames to a column 
cases_clean <- cases %>% rownames_to_column()

# convert test_results rownames to a column
results_clean <- results %>% rownames_to_column()  

# convert all columns in matches dataset to character, so they can be joined to the rownames
matches_clean <- my_matches %>%
  mutate(across(everything(), as.character))



# Join matches to dfA, then add dfB
###################################
# column "inds.b" is added to dfA
complete <- left_join(cases_clean, matches_clean, by = c("rowname" = "inds.a"))

# column(s) from dfB are added 
complete <- left_join(complete, results_clean, by = c("inds.b" = "rowname"))

As performed using the code above, the resulting data frame complete will contain all columns from both cases and results. Many will be appended with suffixes “.x” and “.y”, because the column names would otherwise be duplicated.

Alternatively, to achieve only the “original” 9 records in cases with the new column(s) from results, use select() on results before the joins, so that it contains only rownames and the columns that you want to add to cases (e.g. the column result).

cases_clean <- cases %>% rownames_to_column()

results_clean <- results %>%
  rownames_to_column() %>% 
  select(rowname, result)    # select only certain columns 

matches_clean <- my_matches %>%
  mutate(across(everything(), as.character))

# joins
complete <- left_join(cases_clean, matches_clean, by = c("rowname" = "inds.a"))
complete <- left_join(complete, results_clean, by = c("inds.b" = "rowname"))

If you want to subset either dataset to only the rows that matched, you can use the codes below:

cases_matched <- cases[my_matches$inds.a,]  # Rows in cases that matched to a row in results
results_matched <- results[my_matches$inds.b,]  # Rows in results that matched to a row in cases

Or, to see only the rows that did not match:

cases_not_matched <- cases[!rownames(cases) %in% my_matches$inds.a,]  # Rows in cases that did NOT match to a row in results
results_not_matched <- results[!rownames(results) %in% my_matches$inds.b,]  # Rows in results that did NOT match to a row in cases

Probabilistic deduplication

Probabilistic matching can be used to deduplicate a dataset as well. See the page on deduplication for other methods of deduplication.

Here we began with the cases dataset, but are now calling it cases_dup, as it has 2 additional rows that could be duplicates of previous rows: See “Tony” with “Anthony”, and “Marialisa Rodrigues” with “Maria Rodriguez”.

Run fastLink() like before, but compare the cases_dup data frame to itself. When the two data frames provided are identical, the function assumes you want to de-duplicate. Note we do not specify stringdist.match = or numeric.match = as we did previously.

## Run fastLink on the same dataset
dedupe_output <- fastLink(
  dfA = cases_dup,
  dfB = cases_dup,
  varnames = c("gender", "first", "middle", "last", "yr", "mon", "day", "district")
)
## 
## ==================== 
## fastLink(): Fast Probabilistic Record Linkage
## ==================== 
## 
## If you set return.all to FALSE, you will not be able to calculate a confusion table as a summary statistic.
## dfA and dfB are identical, assuming deduplication of a single data set.
## Setting return.all to FALSE.
## 
## Calculating matches for each variable.
## Getting counts for parameter estimation.
##     Parallelizing calculation using OpenMP. 1 threads out of 12 are used.
## Running the EM algorithm.
## Getting the indices of estimated matches.
##     Parallelizing calculation using OpenMP. 1 threads out of 12 are used.
## Calculating the posterior for each pair of matched observations.
## Getting the match patterns for each estimated match.

Now, you can review the potential duplicates with getMatches(). Provide the data frame as both dfA = and dfB =, and provide the output of the fastLink() function as fl.out =. fl.out must be of class fastLink.dedupe, or in other words, the result of fastLink().

## Run getMatches()
cases_dedupe <- getMatches(
  dfA = cases_dup,
  dfB = cases_dup,
  fl.out = dedupe_output)

See the right-most column, which indicates the duplicate IDs - the final two rows are identified as being likely duplicates of rows 2 and 3.

To return the row numbers of rows which are likely duplicates, you can count the number of rows per unique value in the dedupe.ids column, and then filter to keep only those with more than one row. In this case this leaves rows 2 and 3.

cases_dedupe %>% 
  count(dedupe.ids) %>% 
  filter(n > 1)
##   dedupe.ids n
## 1          2 2
## 2          3 2

To inspect the whole rows of the likely duplicates, put the row number in this command:

# displays row 2 and all likely duplicates of it
cases_dedupe[cases_dedupe$dedupe.ids == 2,]   
##    gender   first middle  last   yr mon day district dedupe.ids
## 2       M Anthony     B. Smith 1970   9  19    River          2
## 10      M    Tony     B. Smith 1970   9  19    River          2

14.4 Binding and aligning

Another method of combining two data frames is “binding” them together. You can also think of this as “appending” or “adding” rows or columns.

This section will also discuss how to “align” the order of rows of one data frame to the order in another data frame. This topic is discussed below in the section on Binding columns.

Bind rows

To bind rows of one data frame to the bottom of another data frame, use bind_rows() from dplyr. It is very inclusive, so any column present in either data frame will be included in the output. A few notes:

  • Unlike the base R version row.bind(), dplyr’s bind_rows() does not require that the order of columns be the same in both data frames. As long as the column names are spelled identically, it will align them correctly.
  • You can optionally specify the argument .id =. Provide a character column name. This will produce a new column that serves to identify which data frame each row originally came from.
  • You can use bind_rows() on a list of similarly-structured data frames to combine them into one data frame. See an example in the Iteration, loops, and lists page involving the import of multiple linelists with purrr.

One common example of row binding is to bind a “total” row onto a descriptive table made with dplyr’s summarise() function. Below we create a table of case counts and median CT values by hospital with a total row.

The function summarise() is used on data grouped by hospital to return a summary data frame by hospital. But the function summarise() does not automatically produce a “totals” row, so we create it by summarising the data again, but with the data not grouped by hospital. This produces a second data frame of just one row. We can then bind these data frames together to achieve the final table.

See other worked examples like this in the Descriptive tables and Tables for presentation pages.

# Create core table
###################
hosp_summary <- linelist %>% 
  group_by(hospital) %>%                        # Group data by hospital
  summarise(                                    # Create new summary columns of indicators of interest
    cases = n(),                                  # Number of rows per hospital-outcome group     
    ct_value_med = median(ct_blood, na.rm=T))     # median CT value per group

Here is the hosp_summary data frame:

Create a data frame with the “total” statistics (not grouped by hospital). This will return just one row.

# create totals
###############
totals <- linelist %>% 
  summarise(
    cases = n(),                               # Number of rows for whole dataset     
    ct_value_med = median(ct_blood, na.rm=T))  # Median CT for whole dataset

And below is that totals data frame. Note how there are only two columns. These columns are also in hosp_summary, but there is one column in hosp_summary that is not in totals (hospital).

Now we can bind the rows together with bind_rows().

# Bind data frames together
combined <- bind_rows(hosp_summary, totals)

Now we can view the result. See how in the final row, an empty NA value fills in for the column hospital that was not in hosp_summary. As explained in the Tables for presentation page, you could “fill-in” this cell with “Total” using replace_na().

Bind columns

There is a similar dplyr function bind_cols() which you can use to combine two data frames sideways. Note that rows are matched to each other by position (not like a join above) - for example the 12th row in each data frame will be aligned.

For an example, we bind several summary tables together. In order to do this, we also demonstrate how to re-arrange the order of rows in one data frame to match the order in another data frame, with match().

Here we define case_info as a summary data frame of linelist cases, by hospital, with the number of cases and the number of deaths.

# Case information
case_info <- linelist %>% 
  group_by(hospital) %>% 
  summarise(
    cases = n(),
    deaths = sum(outcome == "Death", na.rm=T)
  )

And let’s say that here is a different data frame contact_fu containing information on the percent of exposed contacts investigated and “followed-up”, again by hospital.

contact_fu <- data.frame(
  hospital = c("St. Mark's Maternity Hospital (SMMH)", "Military Hospital", "Missing", "Central Hospital", "Port Hospital", "Other"),
  investigated = c("80%", "82%", NA, "78%", "64%", "55%"),
  per_fu = c("60%", "25%", NA, "20%", "75%", "80%")
)

Note that the hospitals are the same, but are in different orders in each data frame. The easiest solution would be to use a left_join() on the hospital column, but you could also use bind_cols() with one extra step.

Use match() to align ordering

Because the row orders are different, a simple bind_cols() command would result in a mis-match of data. To fix this we can use match() from base R to align the rows of a data frame in the same order as in another. We assume for this approach that there are no duplicate values in either data frame.

When we use match(), the syntax is match(TARGET ORDER VECTOR, DATA FRAME COLUMN TO CHANGE), where the first argument is the desired order (either a stand-alone vector, or in this case a column in a data frame), and the second argument is the data frame column in the data frame that will be re-ordered. The output of match() is a vector of numbers representing the correct position ordering. You can read more with ?match.

match(case_info$hospital, contact_fu$hospital)
## [1] 4 2 3 6 5 1

You can use this numeric vector to re-order the data frame - place it within subset brackets [ ] before the comma. Read more about base R bracket subset syntax in the R basics page. The command below creates a new data frame, defined as the old one in which the rows are ordered in the numeric vector above.

contact_fu_aligned <- contact_fu[match(case_info$hospital, contact_fu$hospital),]

Now we can bind the data frame columns together, with the correct row order. Note that some columns are duplicated and will require cleaning with rename(). Read more aboout bind_rows() here.

bind_cols(case_info, contact_fu)
## New names:
## * hospital -> hospital...1
## * hospital -> hospital...4
## # A tibble: 6 x 6
##   hospital...1                         cases deaths hospital...4                         investigated per_fu
##   <chr>                                <int>  <int> <chr>                                <chr>        <chr> 
## 1 Central Hospital                       454    193 St. Mark's Maternity Hospital (SMMH) 80%          60%   
## 2 Military Hospital                      896    399 Military Hospital                    82%          25%   
## 3 Missing                               1469    611 Missing                              <NA>         <NA>  
## 4 Other                                  885    395 Central Hospital                     78%          20%   
## 5 Port Hospital                         1762    785 Port Hospital                        64%          75%   
## 6 St. Mark's Maternity Hospital (SMMH)   422    199 Other                                55%          80%

A base R alternative to bind_cols is cbind(), which performs the same operation.

14.5 Resources

The tidyverse page on joins

The R for Data Science page on relational data

Th tidyverse page on dplyr on binding

A vignette on fastLink at the package’s Github page

Publication describing methodology of fastLink

Publication describing RecordLinkage package

15 De-duplication

This page covers the following de-duplication techniques:

  1. Identifying and removing duplicate rows
  2. “Slicing” rows to keep only certain rows (e.g. min or max) from each group of rows
  3. “Rolling-up”, or combining values from multiple rows into one row

15.1 Preparation

Load packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
  tidyverse,   # deduplication, grouping, and slicing functions
  janitor,     # function for reviewing duplicates
  stringr)      # for string searches, can be used in "rolling-up" values

Import data

For demonstration, we will use an example dataset that is created with the R code below.

The data are records of COVID-19 phone encounters, including encounters with contacts and with cases. The columns include recordID (computer-generated), personID, name, date of encounter, time of encounter, the purpose of the encounter (either to interview as a case or as a contact), and symptoms_ever (whether the person in that encounter reported ever having symptoms).

Here is the code to create the obs dataset:

obs <- data.frame(
  recordID  = c(1,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18),
  personID  = c(1,1,2,2,3,2,4,5,6,7,2,1,3,3,4,5,5,7,8),
  name      = c("adam", "adam", "amrish", "amrish", "mariah", "amrish", "nikhil", "brian", "smita", "raquel", "amrish",
                "adam", "mariah", "mariah", "nikhil", "brian", "brian", "raquel", "natalie"),
  date      = c("1/1/2020", "1/1/2020", "2/1/2020", "2/1/2020", "5/1/2020", "5/1/2020", "5/1/2020", "5/1/2020", "5/1/2020","5/1/2020", "2/1/2020",
                "5/1/2020", "6/1/2020", "6/1/2020", "6/1/2020", "6/1/2020", "7/1/2020", "7/1/2020", "7/1/2020"),
  time      = c("09:00", "09:00", "14:20", "14:20", "12:00", "16:10", "13:01", "15:20", "14:20", "12:30", "10:24",
                "09:40", "07:25", "08:32", "15:36", "15:31", "07:59", "11:13", "17:12"),
  encounter = c(1,1,1,1,1,3,1,1,1,1,2,
                2,2,3,2,2,3,2,1),
  purpose   = c("contact", "contact", "contact", "contact", "case", "case", "contact", "contact", "contact", "contact", "contact",
                "case", "contact", "contact", "contact", "contact", "case", "contact", "case"),
  symptoms_ever = c(NA, NA, "No", "No", "No", "Yes", "Yes", "No", "Yes", NA, "Yes",
                    "No", "No", "No", "Yes", "Yes", "No","No", "No")) %>% 
  mutate(date = as.Date(date, format = "%d/%m/%Y"))

Here is the data frame

Use the filter boxes along the top to review the encounters for each person.

A few things to note as you review the data:

  • The first two records are 100% complete duplicates including duplicate recordID (must be a computer glitch!)
  • The second two rows are duplicates, in all columns except for recordID
  • Several people had multiple phone encounters, at various dates and times, and as contacts and/or cases
  • At each encounter, the person was asked if they had ever had symptoms, and some of this information is missing.

And here is a quick summary of the people and the purposes of their encounters, using tabyl() from janitor:

obs %>% 
  tabyl(name, purpose)
##     name case contact
##     adam    1       2
##   amrish    1       3
##    brian    1       2
##   mariah    1       2
##  natalie    1       0
##   nikhil    0       2
##   raquel    0       2
##    smita    0       1

15.2 Deduplication

This section describes how to review and remove duplicate rows in a data frame. It also show how to handle duplicate elements in a vector.

Examine duplicate rows

To quickly review rows that have duplicates, you can use get_dupes() from the janitor package. By default, all columns are considered when duplicates are evaluated - rows returned by the function are 100% duplicates considering the values in all columns.

In the obs data frame, the first two rows are 100% duplicates - they have the same value in every column (including the recordID column, which is supposed to be unique - it must be some computer glitch). The returned data frame automatically includes a new column dupe_count on the right side, showing the number of rows with that combination of duplicate values.

# 100% duplicates across all columns
obs %>% 
  janitor::get_dupes()

See the original data

However, if we choose to ignore recordID, the 3rd and 4th rows rows are also duplicates of each other. That is, they have the same values in all columns except for recordID. You can specify specific columns to be ignored in the function using a - minus symbol.

# Duplicates when column recordID is not considered
obs %>% 
  janitor::get_dupes(-recordID)         # if multiple columns, wrap them in c()

You can also positively specify the columns to consider. Below, only rows that have the same values in the name and purpose columns are returned. Notice how “amrish” now has dupe_count equal to 3 to reflect his three “contact” encounters.

*Scroll left for more rows**

# duplicates based on name and purpose columns ONLY
obs %>% 
  janitor::get_dupes(name, purpose)

See the original data.

See ?get_dupes for more details, or see this online reference

Keep only unique rows

To keep only unique rows of a data frame, use distinct() from dplyr (as demonstrated in the Cleaning data and core functions page). Rows that are duplicates are removed such that only the first of such rows is kept. By default, “first” means the highest rownumber (order of rows top-to-bottom). Only unique rows remain.

In the example below, we run distinct() such that the column recordID is excluded from consideration - thus two duplicate rows are removed. The first row (for “adam”) was 100% duplicated and has been removed. Also row 3 (for “amrish”) was a duplicate in every column except recordID (which is not being considered) and so is also removed. The obs dataset n is now nrow(obs)-2, not nrow(obs) rows).

Scroll to the left to see the entire data frame

# added to a chain of pipes (e.g. data cleaning)
obs %>% 
  distinct(across(-recordID), # reduces data frame to only unique rows (keeps first one of any duplicates)
           .keep_all = TRUE) 

# if outside pipes, include the data as first argument 
# distinct(obs)

CAUTION: If using distinct() on grouped data, the function will apply to each group.

Deduplicate based on specific columns

You can also specify columns to be the basis for de-duplication. In this way, the de-duplication only applies to rows that are duplicates within the specified columns. Unless you set .keep_all = TRUE, all columns not mentioned will be dropped.

In the example below, the de-duplication only applies to rows that have identical values for name and purpose columns. Thus, “brian” has only 2 rows instead of 3 - his first “contact” encounter and his only “case” encounter. To adjust so that brian’s latest encounter of each purpose is kept, see the tab on Slicing within groups.

Scroll to the left to see the entire data frame

# added to a chain of pipes (e.g. data cleaning)
obs %>% 
  distinct(name, purpose, .keep_all = TRUE) %>%  # keep rows unique by name and purpose, retain all columns
  arrange(name)                                  # arrange for easier viewing

See the original data.

Deduplicate elements in a vector

The function duplicated() from base R will evaluate a vector (column) and return a logical vector of the same length (TRUE/FALSE). The first time a value appears, it will return FALSE (not a duplicate), and subsequent times that value appears it will return TRUE. Note how NA is treated the same as any other value.

x <- c(1, 1, 2, NA, NA, 4, 5, 4, 4, 1, 2)
duplicated(x)
##  [1] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE

To return only the duplicated elements, you can use brackets to subset the original vector:

x[duplicated(x)]
## [1]  1 NA  4  4  1  2

To return only the unique elements, use unique() from base R. To remove NAs from the output, nest na.omit() within unique().

unique(x)           # alternatively, use x[!duplicated(x)]
## [1]  1  2 NA  4  5
unique(na.omit(x))  # remove NAs 
## [1] 1 2 4 5

Using base R

To return duplicate rows

In base R, you can also see which rows are 100% duplicates in a data frame df with the command duplicated(df) (returns a logical vector of the rows).

Thus, you can also use the base subset [ ] on the data frame to see the duplicated rows with df[duplicated(df),] (don’t forget the comma, meaning that you want to see all columns!).

To return unique rows

See the notes above. To see the unique rows you add the logical negator ! in front of the duplicated() function:
df[!duplicated(df),]

To return rows that are duplicates of only certain columns

Subset the df that is within the duplicated() parentheses, so this function will operate on only certain columns of the df.

To specify the columns, provide column numbers or names after a comma (remember, all this is within the duplicated() function).

Be sure to keep the comma , outside after the duplicated() function as well!

For example, to evaluate only columns 2 through 5 for duplicates: df[!duplicated(df[, 2:5]),]
To evaluate only columns name and purpose for duplicates: df[!duplicated(df[, c("name", "purpose)]),]

15.3 Slicing

To “slice” a data frame to apply a filter on the rows by row number/position. This becomes particularly useful if you have multiple rows per functional group (e.g. per “person”) and you only want to keep one or some of them.

The basic slice() function accepts numbers and returns rows in those positions. If the numbers provided are positive, only they are returned. If negative, those rows are not returned. Numbers must be either all positive or all negative.

obs %>% slice(4)  # return the 4th row
##   recordID personID   name       date  time encounter purpose symptoms_ever
## 1        3        2 amrish 2020-01-02 14:20         1 contact            No
obs %>% slice(c(2,4))  # return rows 2 and 4
##   recordID personID   name       date  time encounter purpose symptoms_ever
## 1        1        1   adam 2020-01-01 09:00         1 contact          <NA>
## 2        3        2 amrish 2020-01-02 14:20         1 contact            No
#obs %>% slice(c(2:4))  # return rows 2 through 4

See the original data.

There are several variations: These should be provided with a column and a number of rows to return (to n =).

  • slice_min() and slice_max() keep only the row(s) with the minimium or maximum value(s) of the specified column. This also works to return the “min” and “max” of ordered factors.
  • slice_head() and slice_tail() - keep only the first or last row(s).
  • slice_sample() - keep only a random sample of the rows.
obs %>% slice_max(encounter, n = 1)  # return rows with the largest encounter number
##   recordID personID   name       date  time encounter purpose symptoms_ever
## 1        5        2 amrish 2020-01-05 16:10         3    case           Yes
## 2       13        3 mariah 2020-01-06 08:32         3 contact            No
## 3       16        5  brian 2020-01-07 07:59         3    case            No

Use arguments n = or prop = to specify the number or proportion of rows to keep. If not using the function in a pipe chain, provide the data argument first (e.g. slice(data, n = 2)). See ?slice for more information.

Other arguments:

.order_by = used in slice_min() and slice_max() this is a column to order by before slicing.
with_ties = TRUE by default, meaning ties are kept.
.preserve = FALSE by default. If TRUE then the grouping structure is re-calculated after slicing.
weight_by = Optional, numeric column to weight by (bigger number more likely to get sampled). Also replace = for whether sampling is done with/without replacement.

TIP: When using slice_max() and slice_min(), be sure to specify/write the n = (e.g. n = 2, not just 2). Otherwise you may get an error Error:is not empty.

NOTE: You may encounter the function top_n(), which has been superseded by the slice functions.

Slice with groups

The slice_*() functions can be very useful if applied to a grouped data frame because the slice operation is performed on each group separately. Use the function group_by() in conjunction with slice() to group the data to take a slice from each group.

This is helpful for de-duplication if you have multiple rows per person but only want to keep one of them. You first use group_by() with key columns that are the same per person, and then use a slice function on a column that will differ among the grouped rows.

In the example below, to keep only the latest encounter per person, we group the rows by name and then use slice_max() with n = 1 on the date column. Be aware! To apply a function like slice_max() on dates, the date column must be class Date.

By default, “ties” (e.g. same date in this scenario) are kept, and we would still get multiple rows for some people (e.g. adam). To avoid this we set with_ties = FALSE. We get back only one row per person.

CAUTION: If using arrange(), specify .by_group = TRUE to have the data arranged within each group.

DANGER: If with_ties = FALSE, the first row of a tie is kept. This may be deceptive. See how for Mariah, she has two encounters on her latest date (6 Jan) and the first (earliest) one was kept. Likely, we want to keep her later encounter on that day. See how to “break” these ties in the next example.

obs %>% 
  group_by(name) %>%       # group the rows by 'name'
  slice_max(date,          # keep row per group with maximum date value 
            n = 1,         # keep only the single highest row 
            with_ties = F) # if there's a tie (of date), take the first row

Above, for example we can see that only Amrish’s row on 5 Jan was kept, and only Brian’s row on 7 Jan was kept. See the original data.

Breaking “ties”

Multiple slice statements can be run to “break ties”. In this case, if a person has multiple encounters on their latest date, the encounter with the latest time is kept (lubridate::hm() is used to convert the character times to a sortable time class).
Note how now, the one row kept for “Mariah” on 6 Jan is encounter 3 from 08:32, not encounter 2 at 07:25.

# Example of multiple slice statements to "break ties"
obs %>%
  group_by(name) %>%
  
  # FIRST - slice by latest date
  slice_max(date, n = 1, with_ties = TRUE) %>% 
  
  # SECOND - if there is a tie, select row with latest time; ties prohibited
  slice_max(lubridate::hm(time), n = 1, with_ties = FALSE)

In the example above, it would also have been possible to slice by encounter number, but we showed the slice on date and time for example purposes.

TIP: To use slice_max() or slice_min() on a “character” column, mutate it to an ordered factor class!

See the original data.

Keep all but mark them

If you want to keep all records but mark only some for analysis, consider a two-step approach utilizing a unique recordID/encounter number:

  1. Reduce/slice the orginal data frame to only the rows for analysis. Save/retain this reduced data frame.
  2. In the original data frame, mark rows as appropriate with case_when(), based on whether their record unique identifier (recordID in this example) is present in the reduced data frame.
# 1. Define data frame of rows to keep for analysis
obs_keep <- obs %>%
  group_by(name) %>%
  slice_max(encounter, n = 1, with_ties = FALSE) # keep only latest encounter per person


# 2. Mark original data frame
obs_marked <- obs %>%

  # make new dup_record column
  mutate(dup_record = case_when(
    
    # if record is in obs_keep data frame
    recordID %in% obs_keep$recordID ~ "For analysis", 
    
    # all else marked as "Ignore" for analysis purposes
    TRUE                            ~ "Ignore"))

# print
obs_marked
##    recordID personID    name       date  time encounter purpose symptoms_ever   dup_record
## 1         1        1    adam 2020-01-01 09:00         1 contact          <NA>       Ignore
## 2         1        1    adam 2020-01-01 09:00         1 contact          <NA>       Ignore
## 3         2        2  amrish 2020-01-02 14:20         1 contact            No       Ignore
## 4         3        2  amrish 2020-01-02 14:20         1 contact            No       Ignore
## 5         4        3  mariah 2020-01-05 12:00         1    case            No       Ignore
## 6         5        2  amrish 2020-01-05 16:10         3    case           Yes For analysis
## 7         6        4  nikhil 2020-01-05 13:01         1 contact           Yes       Ignore
## 8         7        5   brian 2020-01-05 15:20         1 contact            No       Ignore
## 9         8        6   smita 2020-01-05 14:20         1 contact           Yes For analysis
## 10        9        7  raquel 2020-01-05 12:30         1 contact          <NA>       Ignore
## 11       10        2  amrish 2020-01-02 10:24         2 contact           Yes       Ignore
## 12       11        1    adam 2020-01-05 09:40         2    case            No For analysis
## 13       12        3  mariah 2020-01-06 07:25         2 contact            No       Ignore
## 14       13        3  mariah 2020-01-06 08:32         3 contact            No For analysis
## 15       14        4  nikhil 2020-01-06 15:36         2 contact           Yes For analysis
## 16       15        5   brian 2020-01-06 15:31         2 contact           Yes       Ignore
## 17       16        5   brian 2020-01-07 07:59         3    case            No For analysis
## 18       17        7  raquel 2020-01-07 11:13         2 contact            No For analysis
## 19       18        8 natalie 2020-01-07 17:12         1    case            No For analysis

See the original data.

Calculate row completeness

Create a column that contains a metric for the row’s completeness (non-missingness). This could be helpful when deciding which rows to prioritize over others when de-duplicating/slicing.

In this example, “key” columns over which you want to measure completeness are saved in a vector of column names.

Then the new column key_completeness is created with mutate(). The new value in each row is defined as a calculated fraction: the number of non-missing values in that row among the key columns, divided by the number of key columns.

This involves the function rowSums() from base R. Also used is ., which within piping refers to the data frame at that point in the pipe (in this case, it is being subset with brackets []).

*Scroll to the right to see more rows**

# create a "key variable completeness" column
# this is a *proportion* of the columns designated as "key_cols" that have non-missing values

key_cols = c("personID", "name", "symptoms_ever")

obs %>% 
  mutate(key_completeness = rowSums(!is.na(.[,key_cols]))/length(key_cols)) 

See the original data.

15.4 Roll-up values

This section describes:

  1. How to “roll-up” values from multiple rows into just one row, with some variations
  2. Once you have “rolled-up” values, how to overwrite/prioritize the values in each cell

This tab uses the example dataset from the Preparation tab.

Roll-up values into one row

The code example below uses group_by() and summarise() to group rows by person, and then paste together all unique values within the grouped rows. Thus, you get one summary row per person. A few notes:

  • A suffix is appended to all new columns ("_roll" in this example)
  • If you want to show only unique values per cell, then wrap the na.omit() with unique()
  • na.omit() removes NA values, but if this is not desired it can be removed paste0(.x)
# "Roll-up" values into one row per group (per "personID") 
cases_rolled <- obs %>% 
  
  # create groups by name
  group_by(personID) %>% 
  
  # order the rows within each group (e.g. by date)
  arrange(date, .by_group = TRUE) %>% 
  
  # For each column, paste together all values within the grouped rows, separated by ";"
  summarise(
    across(everything(),                           # apply to all columns
           ~paste0(na.omit(.x), collapse = "; "))) # function is defined which combines non-NA values

The result is one row per group (ID), with entries arranged by date and pasted together. Scroll to the left to see more rows

See the original data.

This variation shows unique values only:

# Variation - show unique values only 
cases_rolled <- obs %>% 
  group_by(personID) %>% 
  arrange(date, .by_group = TRUE) %>% 
  summarise(
    across(everything(),                                   # apply to all columns
           ~paste0(unique(na.omit(.x)), collapse = "; "))) # function is defined which combines unique non-NA values

This variation appends a suffix to each column.
In this case "_roll" to signify that it has been rolled:

# Variation - suffix added to column names 
cases_rolled <- obs %>% 
  group_by(personID) %>% 
  arrange(date, .by_group = TRUE) %>% 
  summarise(
    across(everything(),                
           list(roll = ~paste0(na.omit(.x), collapse = "; ")))) # _roll is appended to column names

Overwrite values/hierarchy

If you then want to evaluate all of the rolled values, and keep only a specific value (e.g. “best” or “maximum” value), you can use mutate() across the desired columns, to implement case_when(), which uses str_detect() from the stringr package to sequentially look for string patterns and overwrite the cell content.

# CLEAN CASES
#############
cases_clean <- cases_rolled %>% 
    
    # clean Yes-No-Unknown vars: replace text with "highest" value present in the string
    mutate(across(c(contains("symptoms_ever")),                     # operates on specified columns (Y/N/U)
             list(mod = ~case_when(                                 # adds suffix "_mod" to new cols; implements case_when()
               
               str_detect(.x, "Yes")       ~ "Yes",                 # if "Yes" is detected, then cell value converts to yes
               str_detect(.x, "No")        ~ "No",                  # then, if "No" is detected, then cell value converts to no
               str_detect(.x, "Unknown")   ~ "Unknown",             # then, if "Unknown" is detected, then cell value converts to Unknown
               TRUE                        ~ as.character(.x)))),   # then, if anything else if it kept as is
      .keep = "unused")                                             # old columns removed, leaving only _mod columns

Now you can see in the column symptoms_ever that if the person EVER said “Yes” to symptoms, then only “Yes” is displayed.

See the original data.

15.5 Probabilistic de-duplication

Sometimes, you may want to identify “likely” duplicates based on similarity (e.g. string “distance”) across several columns such as name, age, sex, date of birth, etc. You can apply a probabilistic matching algorithm to identify likely duplicates.

See the page on Joining data for an explanation on this method. The section on Probabilistic Matching contains an example of applying these algorithms to compare a data frame to itself, thus performing probabilistic de-duplication.

15.6 Resources

Much of the information in this page is adapted from these resources and vignettes online:

datanovia

dplyr tidyverse reference

cran janitor vignette

16 Iteration, loops, and lists

Epidemiologists often are faced with repeating analyses on subgroups such as countries, districts, or age groups. These are but a few of the many situations involving iteration. Coding your iterative operations using the approaches below will help you perform such repetitive tasks faster, reduce the chance of error, and reduce code length.

This page will introduce two approaches to iterative operations - using for loops and using the package purrr.

  1. for loops iterate code across a series of inputs, but are less common in R than in other programming languages. Nevertheless, we introduce them here as a learning tool and reference
  2. The purrr package is the tidyverse approach to iterative operations - it works by “mapping” a function across many inputs (values, columns, datasets, etc.)

Along the way, we’ll show examples like:

  • Importing and exporting multiple files
  • Creating epicurves for multiple jurisdictions
  • Running T-tests for several columns in a data frame

In the purrr section we will also provide several examples of creating and handling lists.

16.1 Preparation

Load packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
     rio,         # import/export
     here,        # file locator
     purrr,       # iteration
     tidyverse    # data management and visualization
)

Import data

We import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import data with the import() function from the rio package (it handles many file types like .xlsx, .csv, .rds - see the Import and export page for details).

# import the linelist
linelist <- import("linelist_cleaned.rds")

The first 50 rows of the linelist are displayed below.

16.2 for loops

for loops in R

for loops are not emphasized in R, but are common in other programming languages. As a beginner, they can be helpful to learn and practice with because they are easier to “explore”, “de-bug”, and otherwise grasp exactly what is happening for each iteration, especially when you are not yet comfortable writing your own functions.

You may move quickly through for loops to iterating with mapped functions with purrr (see section below).

Core components

A for loop has three core parts:

  1. The sequence of items to iterate through
  2. The operations to conduct per item in the sequence
  3. The container for the results (optional)

The basic syntax is: for (item in sequence) {do operations using item}. Note the parentheses and the curly brackets. The results could be printed to console, or stored in a container R object.

A simple for loop example is below.

for (num in c(1,2,3,4,5)) {  # the SEQUENCE is defined (numbers 1 to 5) and loop is opened with "{"
  print(num + 2)             # The OPERATIONS (add two to each sequence number and print)
}                            # The loop is closed with "}"                            
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
                             # There is no "container" in this example

Sequence

This is the “for” part of a for loop - the operations will run “for” each item in the sequence. The sequence can be a series of values (e.g. names of jurisdictions, diseases, column names, list elements, etc), or it can be a series of consecutive numbers (e.g. 1,2,3,4,5). Each approach has their own utilities, described below.

The basic structure of a sequence statement is item in vector.

  • You can write any character or word in place of “item” (e.g. “i”, “num”, “hosp”, “district”, etc.). The value of this “item” changes with each iteration of the loop, proceeding through each value in the vector.
  • The vector could be of character values, column names, or perhaps a sequence of numbers - these are the values that will change with each iteration. You can use them within the for loop operations using the “item” term.

Example: sequence of character values

In this example, a loop is performed for each value in a pre-defined character vector of hospital names.

# make vector of the hospital names
hospital_names <- unique(linelist$hospital)
hospital_names # print
## [1] "Other"                                "Missing"                              "St. Mark's Maternity Hospital (SMMH)" "Port Hospital"                       
## [5] "Military Hospital"                    "Central Hospital"

We have chosen the term hosp to represent values from the vector hospital_names. For the first iteration of the loop, the value of hosp will be hospital_names[[1]]. For the second loop it will be hospital_names[[2]]. And so on…

# a 'for loop' with character sequence

for (hosp in hospital_names){       # sequence
  
       # OPERATIONS HERE
  }

Example: sequence of column names

This is a variation on the character sequence above, in which the names of an existing R object are extracted and become the vector. For example, the column names of a data frame. Conveniently, in the operations code of the for loop, the column names can be used to index (subset) their original data frame

Below, the sequence is the names() (column names) of the linelist data frame. Our “item” name is col, which will represent each column name as the loops proceeds.

For purposes of example, we include operations code inside the for loop, which is run for every value in the sequence. In this code, the sequence values (column names) are used to index (subset) linelist, one-at-a-time. As taught in the R basics page, double branckets [[ ]] are used to subset. The resulting column is passed to is.na(), then to sum() to produce the number of values in the column that are missing. The result is printed to the console - one number for each column.

A note on indexing with column names - whenever referencing the column itself do not just write “col”! col represents just the character column name! To refer to the entire column you must use the column name as an index on linelist via linelist[[col]].

for (col in names(linelist)){        # loop runs for each column in linelist; column name represented by "col" 
  
  # Example operations code - print number of missing values in column
  print(sum(is.na(linelist[[col]])))  # linelist is indexed by current value of "col"
     
}
## [1] 0
## [1] 0
## [1] 2087
## [1] 256
## [1] 0
## [1] 936
## [1] 1323
## [1] 278
## [1] 86
## [1] 0
## [1] 86
## [1] 86
## [1] 86
## [1] 0
## [1] 0
## [1] 0
## [1] 2088
## [1] 2088
## [1] 0
## [1] 0
## [1] 0
## [1] 249
## [1] 249
## [1] 249
## [1] 249
## [1] 249
## [1] 149
## [1] 765
## [1] 0
## [1] 256

Sequence of numbers

In this approach, the sequence is a series of consecutive numbers. Thus, the value of the “item” is not a character value (e.g. “Central Hospital” or “date_onset”) but is a number. This is useful for looping through data frames, as you can use the “item” number inside the for loop to index the data frame by row number.

For example, let’s say that you want to loop through every row in your data frame and extract certain information. Your “items” would be numeric row numbers. Often, “items” in this case are written as i.

The for loop process could be explained in words as “for every item in a sequence of numbers from 1 to the total number of rows in my data frame, do X”. For the first iteration of the loop, the value of “item” i would be 1. For the second iteration, i would be 2, etc.

Here is what the sequence looks like in code: for (i in 1:nrow(linelist)) {OPERATIONS CODE} where i represents the “item” and 1:nrow(linelist) produces a sequence of consecutive numbers from 1 through the number of rows in linelist.

for (i in 1:nrow(linelist)) {  # use on a data frame
  # OPERATIONS HERE
}  

If you want the sequence to be numbers, but you are starting from a vector (not a data frame), use the shortcut seq_along() to return a sequence of numbers for each element in the vector. For example, for (i in seq_along(hospital_names) {OPERATIONS CODE}.

The below code actually returns numbers, which would become the value of i in their respective loop.

seq_along(hospital_names)  # use on a named vector
## [1] 1 2 3 4 5 6

One advantage of using numbers in the sequence is that is easy to also use the i number to index a container that stores the loop outputs. There is an example of this in the Operations section below.

Operations

This is code within the curly brackets { } of the for loop. You want this code to run for each “item” in the sequence. Therefore, be careful that every part of your code that changes by the “item” is correctly coded such that it actually changes! E.g. remember to use [[ ]] for indexing.

In the example below, we iterate through each row in the linelist. The gender and age values of each row are pasted together and stored in the container character vector cases_demographics. Note how we also use indexing [[i]] to save the loop output to the correct position in the “container” vector.

# create container to store results - a character vector
cases_demographics <- vector(mode = "character", length = nrow(linelist))

# the for loop
for (i in 1:nrow(linelist)){
  
  # OPERATIONS
  # extract values from linelist for row i, using brackets for indexing
  row_gender  <- linelist$gender[[i]]
  row_age     <- linelist$age_years[[i]]    # don't forget to index!
     
  # combine gender-age and store in container vector at indexed location
  cases_demographics[[i]] <- str_c(row_gender, row_age, sep = ",") 

}  # end for loop


# display first 10 rows of container
head(cases_demographics, 10)
##  [1] "m,2"  "f,3"  "m,56" "f,18" "m,3"  "f,16" "f,16" "f,0"  "m,61" "f,27"

Container

Sometimes the results of your for loop will be printed to the console or RStudio Plots pane. Other times, you will want to store the outputs in a “container” for later use. Such a container could be a vector, a data frame, or even a list.

It is most efficient to create the container for the results before even beginning the for loop. In practice, this means creating an empty vector, data frame, or list. These can be created with the functions vector() for vectors or lists, or with matrix() and data.frame() for a data frame.

Empty vector

Use vector() and specify the mode = based on the expected class of the objects you will insert - either “double” (to hold numbers), “character”, or “logical”. You should also set the length = in advance. This should be the length of your for loop sequence.

Say you want to store the median delay-to-admission for each hospital. You would use “double” and set the length to be the number of expected outputs (the number of unique hospitals in the data set).

delays <- vector(
  mode = "double",                            # we expect to store numbers
  length = length(unique(linelist$hospital))) # the number of unique hospitals in the dataset

Empty data frame

You can make an empty data frame by specifying the number of rows and columns like this:

delays <- data.frame(matrix(ncol = 2, nrow = 3))

Empty list

You may want store some plots created by a for loop in a list. A list is like vector, but holds other R objects within it that can be of different classes. Items in a list could be a single number, a dataframe, a vector, and even another list.

You actually initialize an empty list using the same vector() command as above, but with mode = "list". Specify the length however you wish.

plots <- vector(mode = "list", length = 16)

Printing

Note that to print from within a for loop you will likely need to explicitly wrap with the function print().

In this example below, the sequence is an explicit character vector, which is used to subset the linelist by hospital. The results are not stored in a container, but rather are printed to console with the print() function.

for (hosp in hospital_names){ 
     hospital_cases <- linelist %>% filter(hospital == hosp)
     print(nrow(hospital_cases))
}
## [1] 885
## [1] 1469
## [1] 422
## [1] 1762
## [1] 896
## [1] 454

Testing your for loop

To test your loop, you can run a command to make a temporary assignment of the “item”, such as i <- 10 or hosp <- "Central Hospital". Do this outside the loop and then run your operations code only (the code within the curly brackets) to see if the expected results are produced.

Looping plots

To put all three components together (container, sequence, and operations) let’s try to plot an epicurve for each hospital (see page on Epidemic curves).

We can make a nice epicurve of all the cases by gender using the incidence2 package as below:

# create 'incidence' object
outbreak <- incidence2::incidence(   
     x = linelist,                   # dataframe - complete linelist
     date_index = date_onset,        # date column
     interval = "week",              # aggregate counts weekly
     groups = gender,                # group values by gender
     na_as_group = TRUE)             # missing gender is own group

# plot epi curve
plot(outbreak,                       # name of incidence object
     fill = "gender",                # color bars by gender
     color = "black",                # outline color of bars
     title = "Outbreak of ALL cases" # title
     )

To produce a separate plot for each hospital’s cases, we can put this epicurve code within a for loop.

First, we save a named vector of the unique hospital names, hospital_names. The for loop will run once for each of these names: for (hosp in hospital_names). Each iteration of the for loop, the current hospital name from the vector will be represented as hosp for use within the loop.

Within the loop operations, you can write R code as normal, but use the “item” (hosp in this case) knowing that its value will be changing. Within this loop:

  • A filter() is applied to linelist, such that column hospital must equal the current value of hosp
  • The incidence object is created on the filtered linelist
  • The plot for the current hospital is created, with an auto-adjusting title that uses hosp
  • The plot for the current hospital is temporarily saved and then printed
  • The loop then moves onward to repeat with the next hospital in hospital_names
# make vector of the hospital names
hospital_names <- unique(linelist$hospital)

# for each name ("hosp") in hospital_names, create and print the epi curve
for (hosp in hospital_names) {
     
     # create incidence object specific to the current hospital
     outbreak_hosp <- incidence2::incidence(
          x = linelist %>% filter(hospital == hosp),   # linelist is filtered to the current hospital
          date_index = date_onset,
          interval = "week", 
          groups = gender,
          na_as_group = TRUE
     )
     
     # Create and save the plot. Title automatically adjusts to the current hospital
     plot_hosp <- plot(
       outbreak_hosp,
       fill = "gender",
       color = "black",
       title = stringr::str_glue("Epidemic of cases admitted to {hosp}")
     )
     
     # print the plot for the current hospital
     print(plot_hosp)
     
} # end the for loop when it has been run for every hospital in hospital_names 

Tracking progress of a loop

A loop with many iterations can run for many minutes or even hours. Thus, it can be helpful to print the progress to the R console. The if statement below can be placed within the loop operations to print every 100th number. Just adjust it so that i is the “item” in your loop.

# loop with code to print progress every 100 iterations
for (i in seq_len(nrow(linelist))){

  # print progress
  if(i %% 100==0){    # The %% operator is the remainder
    print(i)

}

16.3 purrr and lists

Another approach to iterative operations is the purrr package - it is the tidyverse approach to iteration.

If you are faced with performing the same task several times, it is probably worth creating a generalised solution that you can use across many inputs. For example, producing plots for multiple jurisdictions, or importing and combining many files.

There are also a few other advantages to purrr - you can use it with pipes %>%, it handles errors better than normal for loops, and the syntax is quite clean and simple! If you are using a for loop, you can probably do it more clearly and succinctly with purrr!

Keep in mind that purrr is a functional programming tool. That is, the operations that are to be iteratively applied are wrapped up into functions. See the Writing functions page to learn how to write your own functions.

purrr is also almost entirely based around lists and vectors - so think about it as applying a function to each element of that list/vector!

Load packages

purrr is part of the tidyverse, so there is no need to install/load a separate package.

pacman::p_load(
     rio,            # import/export
     here,           # relative filepaths
     tidyverse,      # data mgmt and viz
     writexl,        # write Excel file with multiple sheets
     readxl          # import Excel with multiple sheets
)

map()

One core purrr function is map(), which “maps” (applies) a function to each input element of a list/vector you provide.

The basic syntax is map(.x = SEQUENCE, .f = FUNCTION, OTHER ARGUMENTS). In a bit more detail:

  • .x = are the inputs upon which the .f function will be iteratively applied - e.g. a vector of jurisdiction names, columns in a data frame, or a list of data frames
  • .f = is the function to apply to each element of the .x input - it could be a function like print() that already exists, or a custom function that you define. The function is often written after a tilde ~ (details below).

A few more notes on syntax:

  • If the function needs no further arguments specified, it can be written with no parentheses and no tilde (e.g. .f = mean). To provide arguments that will be the same value for each iteration, provide them within map() but outside the .f = argument, such as the na.rm = T in map(.x = my_list, .f = mean, na.rm=T).
  • You can use .x (or simply .) within the .f = function as a placeholder for the .x value of that iteration
  • Use tilde syntax (~) to have greater control over the function - write the function as normal with parentheses, such as: map(.x = my_list, .f = ~mean(., na.rm = T)). Use this syntax particularly if the value of an argument will change each iteration, or if it is the value .x itself (see examples below)

The output of using map() is a list - a list is an object class like a vector but whose elements can be of different classes. So, a list produced by map() could contain many data frames, or many vectors, many single values, or even many lists! There are alternative versions of map() explained below that produce other types of outputs (e.g. map_dfr() to produce a data frame, map_chr() to produce character vectors, and map_dbl() to produce numeric vectors).

Example - import and combine Excel sheets

Let’s demonstrate with a common epidemiologist task: - You want to import an Excel workbook with case data, but the data are split across different named sheets in the workbook. How do you efficiently import and combine the sheets into one data frame?

Let’s say we are sent the below Excel workbook. Each sheet contains cases from a given hospital.

Here is one approach that uses map():

  1. map() the function import() so that it runs for each Excel sheet
  2. Combine the imported data frames into one using bind_rows()
  3. Along the way, preserve the original sheet name for each row, storing this information in a new column in the final data frame

First, we need to extract the sheet names and save them. We provide the Excel workbook’s file path to the function excel_sheets() from the package readxl, which extracts the sheet names. We store them in a character vector called sheet_names.

sheet_names <- readxl::excel_sheets("hospital_linelists.xlsx")

Here are the names:

sheet_names
## [1] "Central Hospital"              "Military Hospital"             "Missing"                       "Other"                         "Port Hospital"                
## [6] "St. Mark's Maternity Hospital"

Now that we have this vector of names, map() can provide them one-by-one to the function import(). In this example, the sheet_names are .x and import() is the function .f.

Recall from the Import and export page that when used on Excel workbooks, import() can accept the argument which = specifying the sheet to import. Within the .f function import(), we provide which = .x, whose value will change with each iteration through the vector sheet_names - first “Central Hospital”, then “Military Hospital”, etc.

Of note - because we have used map(), the data in each Excel sheet will be saved as a separate data frame within a list. We want each of these list elements (data frames) to have a name, so before we pass sheet_names to map() we pass it through set_names() from purrr, which ensures that each list element gets the appropriate name.

We save the output list as combined.

combined <- sheet_names %>% 
  purrr::set_names() %>% 
  map(.f = ~import("hospital_linelists.xlsx", which = .x))

When we inspect output, we see that the data from each Excel sheet is saved in the list with a name. This is good, but we are not quite finished.

Lastly, we use the function bind_rows() (from dplyr) which accepts the list of similarly-structured data frames and combines them into one data frame. To create a new column from the list element names, we use the argument .id = and provide it with the desired name for the new column.

Below is the whole sequence of commands:

sheet_names <- readxl::excel_sheets("hospital_linelists.xlsx")  # extract sheet names
 
combined <- sheet_names %>%                                     # begin with sheet names
  purrr::set_names() %>%                                        # set their names
  map(.f = ~import("hospital_linelists.xlsx", which = .x)) %>%  # iterate, import, save in list
  bind_rows(.id = "origin_sheet") # combine list of data frames, preserving origin in new column  

And now we have one data frame with a column containing the sheet of origin!

There are variations of map() that you should be aware of. For example, map_dfr() returns a data frame, not a list. Thus, we could have used it for the task above and not have had to bind rows. But then we would not have been able to capture which sheet (hospital) each case came from.

Other variations include map_chr(), map_dbl(). These are very useful functions for two reasons. Firstly. they automatically convert the output of an iterative function into a vector (not a list). Secondly, they can explicitly control the class that the data comes back in - you ensure that your data comes back as a character vector with map_chr(), or numeric vector with map_dbl(). Lets return to these later in the section!

The functions map_at() and map_if() are also very useful for iteration - they allow you to specify which elements of a list you should iterate at! These work by simply applying a vector of indexes/names (in the case of map_at()) or a logical test (in the case of map_if()).

Lets use an example where we didn’t want to read the first sheet of hospital data. We use map_at() instead of map(), and specify the .at = argument to c(-1) which means to not use the first element of .x. Alternatively, you can provide a vector of positive numbers, or names, to .at = to specify which elements to use.

sheet_names <- readxl::excel_sheets("hospital_linelists.xlsx")

combined <- sheet_names %>% 
     purrr::set_names() %>% 
     # exclude the first sheet
     map_at(.f = ~import( "hospital_linelists.xlsx", which = .x),
            .at = c(-1))

Note that the first sheet name will still appear as an element of the output list - but it is only a single character name (not a data frame). You would need to remove this element before binding rows. We will cover how to remove and modify list elements in a later section.

Split dataset and export

Below, we give an example of how to split a dataset into parts and then use map() iteration to export each part as a separate Excel sheet, or as a separate CSV file.

Split dataset

Let’s say we have the complete case linelist as a data frame, and we now want to create a separate linelist for each hospital and export each as a separate CSV file. Below, we do the following steps:

Use group_split() (from dplyr) to split the linelist data frame by unique values in column hospital. The output is a list containing one data frame per hospital subset.

linelist_split <- linelist %>% 
     group_split(hospital)

We can run View(linelist_split) and see that this list contains 6 data frames (“tibbles”), each representing the cases from one hospital.

However, note that the data frames in the list do not have names by default! We want each to have a name, and then to use that name when saving the CSV file.

One approach to extracting the names is to use pull() (from dplyr) to extract the hospital column from each data frame in the list. Then, to be safe, we convert the values to character and then use unique() to get the name for that particular data frame. All of these steps are applied to each data frame via map().

names(linelist_split) <- linelist_split %>%   # Assign to names of listed data frames 
     # Extract the names by doing the following to each data frame: 
     map(.f = ~pull(.x, hospital)) %>%        # Pull out hospital column
     map(.f = ~as.character(.x)) %>%          # Convert to character, just in case
     map(.f = ~unique(.x))                    # Take the unique hospital name

We can now see that each of the list elements has a name. These names can be accessed via names(linelist_split).

names(linelist_split)
## [1] "Central Hospital"                     "Military Hospital"                    "Missing"                              "Other"                               
## [5] "Port Hospital"                        "St. Mark's Maternity Hospital (SMMH)"
More than one group_split() column

If you wanted to split the linelist by more than one grouping column, such as to produce subset linelist by intersection of hospital AND gender, you will need a different approach to naming the list elements. This involves collecting the unique “group keys” using group_keys() from dplyr - they are returned as a data frame. Then you can combine the group keys into values with unite() as shown below, and assign these conglomerate names to linelist_split.

# split linelist by unique hospital-gender combinations
linelist_split <- linelist %>% 
     group_split(hospital, gender)

# extract group_keys() as a dataframe
groupings <- linelist %>% 
     group_by(hospital, gender) %>%       
     group_keys()

groupings      # show unique groupings 
## # A tibble: 18 x 2
##    hospital                             gender
##    <chr>                                <chr> 
##  1 Central Hospital                     f     
##  2 Central Hospital                     m     
##  3 Central Hospital                     <NA>  
##  4 Military Hospital                    f     
##  5 Military Hospital                    m     
##  6 Military Hospital                    <NA>  
##  7 Missing                              f     
##  8 Missing                              m     
##  9 Missing                              <NA>  
## 10 Other                                f     
## 11 Other                                m     
## 12 Other                                <NA>  
## 13 Port Hospital                        f     
## 14 Port Hospital                        m     
## 15 Port Hospital                        <NA>  
## 16 St. Mark's Maternity Hospital (SMMH) f     
## 17 St. Mark's Maternity Hospital (SMMH) m     
## 18 St. Mark's Maternity Hospital (SMMH) <NA>

Now we combine the groupings together, separated by dashes, and assign them as the names of list elements in linelist_split. This takes some extra lines as we replace NA with “Missing”, use unite() from dplyr to combine the column values together (separated by dashes), and then convert into an un-named vector so it can be used as names of linelist_split.

# Combine into one name value 
names(linelist_split) <- groupings %>% 
     mutate(across(everything(), replace_na, "Missing")) %>%  # replace NA with "Missing" in all columns
     unite("combined", sep = "-") %>%                         # Unite all column values into one
     setNames(NULL) %>% 
     as_vector() %>% 
     as.list()

Export as Excel sheets

To export the hospital linelists as an Excel workbook with one linelist per sheet, we can just provide the named list linelist_split to the write_xlsx() function from the writexl package. This has the ability to save one Excel workbook with multiple sheets. The list element names are automatically applied as the sheet names.

linelist_split %>% 
     writexl::write_xlsx(path = here("data", "hospital_linelists.xlsx"))

You can now open the Excel file and see that each hospital has its own sheet.

Export as CSV files

It is a bit more complex command, but you can also export each hospital-specific linelist as a separate CSV file, with a file name specific to the hospital.

Again we use map(): we take the vector of list element names (shown above) and use map() to iterate through them, applying export() (from the rio package, see Import and export page) on the data frame in the list linelist_split that has that name. We also use the name to create a unique file name. Here is how it works:

  • We begin with the vector of character names, passed to map() as .x
  • The .f function is export() , which requires a data frame and a file path to write to
  • The input .x (the hospital name) is used within .f to extract/index that specific element of linelist_split list. This results in only one data frame at a time being provided to export().
  • For example, when map() iterates for “Military Hospital”, then linelist_split[[.x]] is actually linelist_split[["Military Hospital"]], thus returning the second element of linelist_split - which is all the cases from Military Hospital.
  • The file path provided to export() is dynamic via use of str_glue() (see Characters and strings page):
    • here() is used to get the base of the file path and specify the “data” folder (note single quotes to not interrupt the str_glue() double quotes)
  • Then a slash /, and then again the .x which prints the current hospital name to make the file identifiable
  • Finally the extension “.csv” which export() uses to create a CSV file
names(linelist_split) %>%
     map(.f = ~export(linelist_split[[.x]], file = str_glue("{here('data')}/{.x}.csv")))

Now you can see that each file is saved in the “data” folder of the R Project “Epi_R_handbook”!

Custom functions

You may want to create your own function to provide to map().

Let’s say we want to create epidemic curves for each hospital’s cases. To do this using purrr, our .f function can be ggplot() and extensions with + as usual. As the output of map() is always a list, the plots are stored in a list. Because they are plots, they can be extracted and plotted with the ggarrange() function from the ggpubr package (documentation).

# load package for plotting elements from list
pacman::p_load(ggpubr)

# map across the vector of 6 hospital "names" (created earlier)
# use the ggplot function specified
# output is a list with 6 ggplots

hospital_names <- unique(linelist$hospital)

my_plots <- map(
  .x = hospital_names,
  .f = ~ggplot(data = linelist %>% filter(hospital == .x)) +
                geom_histogram(aes(x = date_onset)) +
                labs(title = .x)
)

# print the ggplots (they are stored in a list)
ggarrange(plotlist = my_plots, ncol = 2, nrow = 3)

If this map() code looks too messy, you can achieve the same result by saving your specific ggplot() command as a custom user-defined function, for example we can name it make_epicurve()). This function is then used within the map(). .x will be iteratively replaced by the hospital name, and used as hosp_name in the make_epicurve() function. See the page on Writing functions.

# Create function
make_epicurve <- function(hosp_name){
  
  ggplot(data = linelist %>% filter(hospital == hosp_name)) +
    geom_histogram(aes(x = date_onset)) +
    theme_classic()+
    labs(title = hosp_name)
  
}
# mapping
my_plots <- map(hospital_names, ~make_epicurve(hosp_name = .x))

# print the ggplots (they are stored in a list)
ggarrange(plotlist = my_plots, ncol = 2, nrow = 3)

Mapping a function across columns

Another common use-case is to map a function across many columns. Below, we map() the function t.test() across numeric columns in the data frame linelist, comparing the numeric values by gender.

Recall from the page on Simple statistical tests that t.test() can take inputs in a formula format, such as t.test(numeric column ~ binary column). In this example, we do the following:

  • The numeric columns of interest are selected from linelist - these become the .x inputs to map()
  • The function t.test() is supplied as the .f function, which is applied to each numeric column
  • Within the parentheses of t.test():
    • the first ~ precedes the .f that map() will iterate over .x
    • the .x represents the current column being supplied to the function t.test()
    • the second ~ is part of the t-test equation described above
    • the t.test() function expects a binary column on the right-hand side of the equation. We supply the vector linelist$gender independently and statically (note that it is not included in select()).

map() returns a list, so the output is a list of t-test results - one list element for each numeric column analysed.

# Results are saved as a list
t.test_results <- linelist %>% 
  select(age, wt_kg, ht_cm, ct_blood, temp) %>%  # keep only some numeric columns to map across
  map(.f = ~t.test(.x ~ linelist$gender))        # t.test function, with equation NUMERIC ~ CATEGORICAL

Here is what the list t.test_results looks like when opened (Viewed) in RStudio. We have highlighted parts that are important for the examples in this page.

  • You can see at the top that the whole list is named t.test_results and has five elements. Those five elements are named age, wt_km, ht_cm, ct_blood, temp after each variable that was used in a t-test with gender from the linelist.
  • Each of those five elements are themselves lists, with elements within them such as p.value and conf.int. Some of these elements like p.value are single numbers, whereas some such as estimate consist of two or more elements (mean in group f and mean in group m).

Note: Remember that if you want to apply a function to only certain columns in a data frame, you can also simply use mutate() and across(), as explained in the Cleaning data and core functions page. Below is an example of applying as.character() to only the “age” columns. Note the placement of the parentheses and commas.

# convert columns with column name containing "age" to class Character
linelist <- linelist %>% 
  mutate(across(.cols = contains("age"), .fns = as.character))  

Extract from lists

As map() produces an output of class List, we will spend some time discussing how to extract data from lists using accompanying purrr functions. To demonstrate this, we will use the list t.test_results from the previous section. This is a list of 5 lists - each of the 5 lists contains the results of a t-test between a column from linelist data frame and its binary column gender. See the image in the section above for a visual of the list structure.

Names of elements

To extract the names of the elements themselves, simply use names() from base R. In this case, we use names() on t.test_results to return the names of each sub-list, which are the names of the 5 variables that had t-tests performed.

names(t.test_results)
## [1] "age"      "wt_kg"    "ht_cm"    "ct_blood" "temp"

Elements by name or position

To extract list elements by name or by position you can use brackets [[ ]] as described in the R basics page. Below we use double brackets to index the list t.tests_results and display the first element which is the results of the t-test on age.

t.test_results[[1]] # first element by position
## 
##  Welch Two Sample t-test
## 
## data:  .x by linelist$gender
## t = -21.3, df = 4902.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group f and group m is not equal to 0
## 95 percent confidence interval:
##  -7.544409 -6.272675
## sample estimates:
## mean in group f mean in group m 
##        12.66085        19.56939
t.test_results[[1]]["p.value"] # return element named "p.value" from first element  
## $p.value
## [1] 2.350374e-96

However, below we will demonstrate use of the simple and flexible purrr functions map() and pluck() to achieve the same outcomes.

pluck()

pluck() pulls out elements by name or by position. For example - to extract the t-test results for age, you can use pluck() like this:

t.test_results %>% 
  pluck("age")        # alternatively, use pluck(1)
## 
##  Welch Two Sample t-test
## 
## data:  .x by linelist$gender
## t = -21.3, df = 4902.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group f and group m is not equal to 0
## 95 percent confidence interval:
##  -7.544409 -6.272675
## sample estimates:
## mean in group f mean in group m 
##        12.66085        19.56939

Index deeper levels by specifying the further levels with commas. The below extracts the element named “p.value” from the list age within the list t.test_results. You can also use numbers instead of character names.

t.test_results %>% 
  pluck("age", "p.value")
## [1] 2.350374e-96

You can extract such inner elements from all first-level elements by using map() to run the pluck() function across each first-level element. For example, the below code extracts the “p.value” elements from all lists within t.test_results. The list of t-test results is the .x iterated across, pluck() is the .f function being iterated, and the value “p-value” is provided to the function.

t.test_results %>%
  map(pluck, "p.value")   # return every p-value
## $age
## [1] 2.350374e-96
## 
## $wt_kg
## [1] 2.664367e-182
## 
## $ht_cm
## [1] 3.515713e-144
## 
## $ct_blood
## [1] 0.4473498
## 
## $temp
## [1] 0.5735923

As another alternative, map() offers a shorthand where you can write the element name in quotes, and it will pluck it out. If you use map() the output will be a list, whereas if you use map_chr() it will be a named character vector and if you use map_dbl() it will be a named numeric vector.

t.test_results %>% 
  map_dbl("p.value")   # return p-values as a named numeric vector
##           age         wt_kg         ht_cm      ct_blood          temp 
##  2.350374e-96 2.664367e-182 3.515713e-144  4.473498e-01  5.735923e-01

You can read more about pluck() in it’s purrr documentation. It has a sibling function chuck() that will return an error instead of NULL if an element does not exist.

Convert list to data frame

This is a complex topic - see the Resources section for more complete tutorials. Nevertheless, we will demonstrate converting the list of t-test results into a data frame. We will create a data frame with columns for the variable, its p-value, and the means from the two groups (male and female).

Here are some of the new approaches and functions that will be used:

  • The function tibble() will be used to create a tibble (like a data frame)
    • We surround the tibble() function with curly brackets { } to prevent the entire t.test_results from being stored as the first tibble column
  • Within tibble(), each column is created explicitly, similar to the syntax of mutate():
    • The . represents t.test_results
    • To create a column with the t-test variable names (the names of each list element) we use names() as described above
    • To create a column with the p-values we use map_dbl() as described above to pull the p.value elements and convert them to a numeric vector
t.test_results %>% {
  tibble(
    variables = names(.),
    p         = map_dbl(., "p.value"))
  }
## # A tibble: 5 x 2
##   variables         p
##   <chr>         <dbl>
## 1 age       2.35e- 96
## 2 wt_kg     2.66e-182
## 3 ht_cm     3.52e-144
## 4 ct_blood  4.47e-  1
## 5 temp      5.74e-  1

But now let’s add columns containing the means for each group (males and females).

We would need to extract the element estimate, but this actually contains two elements within it (mean in group f and mean in group m). So, it cannot be simplified into a vector with map_chr() or map_dbl(). Instead, we use map(), which used within tibble() will create a column of class list within the tibble! Yes, this is possible!

t.test_results %>% 
  {tibble(
    variables = names(.),
    p = map_dbl(., "p.value"),
    means = map(., "estimate"))}
## # A tibble: 5 x 3
##   variables         p means       
##   <chr>         <dbl> <named list>
## 1 age       2.35e- 96 <dbl [2]>   
## 2 wt_kg     2.66e-182 <dbl [2]>   
## 3 ht_cm     3.52e-144 <dbl [2]>   
## 4 ct_blood  4.47e-  1 <dbl [2]>   
## 5 temp      5.74e-  1 <dbl [2]>

Once you have this list column, there are several tidyr functions (part of tidyverse) that help you “rectangle” or “un-nest” these “nested list” columns. Read more about them here, or by running vignette("rectangle"). In brief:

  • unnest_wider() - gives each element of a list-column its own column
  • unnest_longer() - gives each element of a list-column its own row
  • hoist() - acts like unnest_wider() but you specify which elements to unnest

Below, we pass the tibble to unnest_wider() specifying the tibble’s means column (which is a nested list). The result is that means is replaced by two new columns, each reflecting the two elements that were previously in each means cell.

t.test_results %>% 
  {tibble(
    variables = names(.),
    p = map_dbl(., "p.value"),
    means = map(., "estimate")
    )} %>% 
  unnest_wider(means)
## # A tibble: 5 x 4
##   variables         p `mean in group f` `mean in group m`
##   <chr>         <dbl>             <dbl>             <dbl>
## 1 age       2.35e- 96              12.7              19.6
## 2 wt_kg     2.66e-182              45.8              59.6
## 3 ht_cm     3.52e-144             109.              142. 
## 4 ct_blood  4.47e-  1              21.2              21.2
## 5 temp      5.74e-  1              38.6              38.6

Discard, keep, and compact lists

Because working with purrr so often involves lists, we will briefly explore some purrr functions to modify lists. See the Resources section for more complete tutorials on purrr functions.

  • list_modify() has many uses, one of which can be to remove a list element
  • keep() retains the elements specified to .p =, or where a function supplied to .p = evaluates to TRUE
  • discard() removes the elements specified to .p, or where a function supplied to .p = evaluates to TRUE
  • compact() removes all empty elements

Here are some examples using the combined list created in the section above on using map() to import and combine multiple files (it contains 6 case linelist data frames):

Elements can be removed by name with list_modify() and setting the name equal to NULL.

combined %>% 
  list_modify("Central Hospital" = NULL)   # remove list element by name

You can also remove elements by criteria, by providing a “predicate” equation to .p = (an equation that evaluates to either TRUE or FALSE). Place a tilde ~ before the function and use .x to represent the list element. Using keep() the list elements that evaluate to TRUE will be kept. Inversely, if using discard() the list elements that evaluate to TRUE will be removed.

# keep only list elements with more than 500 rows
combined %>% 
  keep(.p = ~nrow(.x) > 500)  

In the below example, list elements are discarded if their class are not data frames.

# Discard list elements that are not data frames
combined %>% 
  discard(.p = ~class(.x) != "data.frame")

Your predicate function can also reference elements/columns within each list item. For example, below, list elements where the mean of column ct_blood is over 25 are discarded.

# keep only list elements where ct_blood column mean is over 25
combined %>% 
  discard(.p = ~mean(.x$ct_blood) > 25)  

This command would remove all empty list elements:

# Remove all empty list elements
combined %>% 
  compact()

pmap()

THIS SECTION IS UNDER CONSTRUCTION

16.4 Apply functions

The “apply” family of functions is a base R alternative to purrr for iterative operations. You can read more about them here.

(PART) Analysis

17 Descriptive tables

This page demonstrates the use of janitor, dplyr, gtsummary, rstatix, and base R to summarise data and create tables with descriptive statistics.

This page covers how to create* the underlying tables, whereas the Tables for presentation page covers how to nicely format and print them.*

Each of these packages has advantages and disadvantages in the areas of code simplicity, accessibility of outputs, quality of printed outputs. Use this page to decide which approach works for your scenario.

You have several choices when producing tabulation and cross-tabulation summary tables. Some of the factors to consider include code simplicity, customizeability, the desired output (printed to R console, as data frame, or as “pretty” .png/.jpeg/.html image), and ease of post-processing. Consider the points below as you choose the tool for your situation.

  • Use tabyl() from janitor to produce and “adorn” tabulations and cross-tabulations
  • Use get_summary_stats() from rstatix to easily generate data frames of numeric summary statistics for multiple columns and/or groups
  • Use summarise() and count() from dplyr for more complex statistics, tidy data frame outputs, or preparing data for ggplot()
  • Use tbl_summary() from gtsummary to produce detailed publication-ready tables
  • Use table() from base R if you do not have access to the above packages

17.1 Preparation

Load packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
  rio,          # File import
  here,         # File locator
  skimr,        # get overview of data
  tidyverse,    # data management + ggplot2 graphics 
  gtsummary,    # summary statistics and tests
  rstatix,      # summary statistics and statistical tests
  janitor,      # adding totals and percents to tables
  scales,       # easily convert proportions to percents  
  flextable     # converting tables to pretty images
  )

Import data

We import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import your data with the import() function from the rio package (it accepts many file types like .xlsx, .rds, .csv - see the Import and export page for details).

# import the linelist
linelist <- import("linelist_cleaned.rds")

The first 50 rows of the linelist are displayed below.

17.2 Browse data

skimr package

By using the skimr package, you can get a detailed and aesthetically pleasing overview of each of the variables in your dataset. Read more about skimr at its github page.

Below, the function skim() is applied to the entire linelist data frame. An overview of the data frame and a summary of every column (by class) is produced.

## get information about each variable in a dataset 
skim(linelist)
(#tab:unnamed-chunk-655)Data summary
Name linelist
Number of rows 5888
Number of columns 30
_______________________
Column type frequency:
character 13
Date 4
factor 2
numeric 11
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
case_id 0 1.00 6 6 0 5888 0
outcome 1323 0.78 5 7 0 2 0
gender 278 0.95 1 1 0 2 0
age_unit 0 1.00 5 6 0 2 0
hospital 0 1.00 5 36 0 6 0
infector 2088 0.65 6 6 0 2697 0
source 2088 0.65 5 7 0 2 0
fever 249 0.96 2 3 0 2 0
chills 249 0.96 2 3 0 2 0
cough 249 0.96 2 3 0 2 0
aches 249 0.96 2 3 0 2 0
vomit 249 0.96 2 3 0 2 0
time_admission 765 0.87 5 5 0 1072 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date_infection 2087 0.65 2014-03-19 2015-04-27 2014-10-11 359
date_onset 256 0.96 2014-04-07 2015-04-30 2014-10-23 367
date_hospitalisation 0 1.00 2014-04-17 2015-04-30 2014-10-23 363
date_outcome 936 0.84 2014-04-19 2015-06-04 2014-11-01 371

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
age_cat 86 0.99 FALSE 8 0-4: 1095, 5-9: 1095, 20-: 1073, 10-: 941
age_cat5 86 0.99 FALSE 17 0-4: 1095, 5-9: 1095, 10-: 941, 15-: 743

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
generation 0 1.00 16.56 5.79 0.00 13.00 16.00 20.00 37.00
age 86 0.99 16.07 12.62 0.00 6.00 13.00 23.00 84.00
age_years 86 0.99 16.02 12.64 0.00 6.00 13.00 23.00 84.00
lon 0 1.00 -13.23 0.02 -13.27 -13.25 -13.23 -13.22 -13.21
lat 0 1.00 8.47 0.01 8.45 8.46 8.47 8.48 8.49
wt_kg 0 1.00 52.64 18.58 -11.00 41.00 54.00 66.00 111.00
ht_cm 0 1.00 124.96 49.52 4.00 91.00 129.00 159.00 295.00
ct_blood 0 1.00 21.21 1.69 16.00 20.00 22.00 22.00 26.00
temp 149 0.97 38.56 0.98 35.20 38.20 38.80 39.20 40.80
bmi 0 1.00 46.89 55.39 -1200.00 24.56 32.12 50.01 1250.00
days_onset_hosp 256 0.96 2.06 2.26 0.00 1.00 1.00 3.00 22.00

You can also use the summary() function, from base R, to get information about an entire dataset, but this output can be more difficult to read than using skimr. Therefore the output is not shown below, to conserve page space.

## get information about each column in a dataset 
summary(linelist)

Summary statistics

You can use base R functions to return summary statistics on a numeric column. You can return most of the useful summary statistics for a numeric column using summary(), as below. Note that the data frame name must also be specified as shown below.

summary(linelist$age_years)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    6.00   13.00   16.02   23.00   84.00      86

You can access and save one specific part of it with index brackets [ ]:

summary(linelist$age_years)[[2]]            # return only the 2nd element
## [1] 6
# equivalent, alternative to above by element name
# summary(linelist$age_years)[["1st Qu."]]  

You can return individual statistics with base R functions like max(), min(), median(), mean(), quantile(), sd(), and range(). See the R basics page for a complete list.

CAUTION: If your data contain missing values, R wants you to know this and so will return NA unless you specify to the above mathematical functions that you want R to ignore missing values, via the argument na.rm = TRUE.

You can use the get_summary_stats() function from rstatix to return summary statistics in a data frame format. This can be helpful for performing subsequent operations or plotting on the numbers. See the Simple statistical tests page for more details on the rstatix package and its functions.

linelist %>% 
  get_summary_stats(
    age, wt_kg, ht_cm, ct_blood, temp,  # columns to calculate for
    type = "common")                    # summary stats to return
## # A tibble: 5 x 10
##   variable     n   min   max median   iqr  mean     sd    se    ci
##   <chr>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1 age       5802   0    84     13      17  16.1 12.6   0.166 0.325
## 2 ct_blood  5888  16    26     22       2  21.2  1.69  0.022 0.043
## 3 ht_cm     5888   4   295    129      68 125.  49.5   0.645 1.26 
## 4 temp      5739  35.2  40.8   38.8     1  38.6  0.977 0.013 0.025
## 5 wt_kg     5888 -11   111     54      25  52.6 18.6   0.242 0.475

17.3 janitor package

The janitor packages offers the tabyl() function to produce tabulations and cross-tabulations, which can be “adorned” or modified with helper functions to display percents, proportions, counts, etc.

Below, we pipe the linelist data frame to janitor functions and print the result. If desired, you can also save the resulting tables with the assignment operator <-.

Simple tabyl

The default use of tabyl() on a specific column produces the unique values, counts, and column-wise “percents” (actually proportions). The proportions may have many digits. You can adjust the number of decimals with adorn_rounding() as described below.

linelist %>% tabyl(age_cat)
##  age_cat    n     percent valid_percent
##      0-4 1095 0.185971467   0.188728025
##      5-9 1095 0.185971467   0.188728025
##    10-14  941 0.159816576   0.162185453
##    15-19  743 0.126188859   0.128059290
##    20-29 1073 0.182235054   0.184936229
##    30-49  754 0.128057065   0.129955188
##    50-69   95 0.016134511   0.016373664
##      70+    6 0.001019022   0.001034126
##     <NA>   86 0.014605978            NA

As you can see above, if there are missing values they display in a row labeled <NA>. You can suppress them with show_na = FALSE. If there are no missing values, this row will not appear. If there are missing values, all proportions are given as both raw (denominator inclusive of NA counts) and “valid” (denominator excludes NA counts).

If the column is class Factor and only certain levels are present in your data, all levels will still appear in the table. You can suppress this feature by specifying show_missing_levels = FALSE. Read more on the Factors page.

Cross-tabulation

Cross-tabulation counts are achieved by adding one or more additional columns within tabyl(). Note that now only counts are returned - proportions and percents can be added with additional steps shown below.

linelist %>% tabyl(age_cat, gender)
##  age_cat   f   m NA_
##      0-4 640 416  39
##      5-9 641 412  42
##    10-14 518 383  40
##    15-19 359 364  20
##    20-29 468 575  30
##    30-49 179 557  18
##    50-69   2  91   2
##      70+   0   5   1
##     <NA>   0   0  86

“Adorning” the tabyl

Use janitor’s “adorn” functions to add totals or convert to proportions, percents, or otherwise adjust the display. Often, you will pipe the tabyl through several of these functions.

Function Outcome
adorn_totals() Adds totals (where = “row”, “col”, or “both”). Set name = for “Total”.
adorn_percentages() Convert counts to proportions, with denominator = “row”, “col”, or “all”
adorn_pct_formatting() Converts proportions to percents. Specify digits =. Remove the “%” symbol with affix_sign = FALSE.
adorn_rounding() To round proportions to digits = places. To round percents use adorn_pct_formatting() with digits =.
adorn_ns() Add counts to a table of proportions or percents. Indicate position = “rear” to show counts in parentheses, or “front” to put the percents in parentheses.
adorn_title() Add string via arguments row_name = and/or col_name =

Be conscious of the order you apply the above functions. Below are some examples.

A simple one-way table with percents instead of the default proportions.

linelist %>%               # case linelist
  tabyl(age_cat) %>%       # tabulate counts and proportions by age category
  adorn_pct_formatting()   # convert proportions to percents
##  age_cat    n percent valid_percent
##      0-4 1095   18.6%         18.9%
##      5-9 1095   18.6%         18.9%
##    10-14  941   16.0%         16.2%
##    15-19  743   12.6%         12.8%
##    20-29 1073   18.2%         18.5%
##    30-49  754   12.8%         13.0%
##    50-69   95    1.6%          1.6%
##      70+    6    0.1%          0.1%
##     <NA>   86    1.5%             -

A cross-tabulation with a total row and row percents.

linelist %>%                                  
  tabyl(age_cat, gender) %>%                  # counts by age and gender
  adorn_totals(where = "row") %>%             # add total row
  adorn_percentages(denominator = "row") %>%  # convert counts to proportions
  adorn_pct_formatting(digits = 1)            # convert proportions to percents
##  age_cat     f     m    NA_
##      0-4 58.4% 38.0%   3.6%
##      5-9 58.5% 37.6%   3.8%
##    10-14 55.0% 40.7%   4.3%
##    15-19 48.3% 49.0%   2.7%
##    20-29 43.6% 53.6%   2.8%
##    30-49 23.7% 73.9%   2.4%
##    50-69  2.1% 95.8%   2.1%
##      70+  0.0% 83.3%  16.7%
##     <NA>  0.0%  0.0% 100.0%
##    Total 47.7% 47.6%   4.7%

A cross-tabulation adjusted so that both counts and percents are displayed.

linelist %>%                                  # case linelist
  tabyl(age_cat, gender) %>%                  # cross-tabulate counts
  adorn_totals(where = "row") %>%             # add a total row
  adorn_percentages(denominator = "col") %>%  # convert to proportions
  adorn_pct_formatting() %>%                  # convert to percents
  adorn_ns(position = "front") %>%            # display as: "count (percent)"
  adorn_title(                                # adjust titles
    row_name = "Age Category",
    col_name = "Gender")
##                      Gender                           
##  Age Category             f             m          NA_
##           0-4  640  (22.8%)  416  (14.8%)  39  (14.0%)
##           5-9  641  (22.8%)  412  (14.7%)  42  (15.1%)
##         10-14  518  (18.5%)  383  (13.7%)  40  (14.4%)
##         15-19  359  (12.8%)  364  (13.0%)  20   (7.2%)
##         20-29  468  (16.7%)  575  (20.5%)  30  (10.8%)
##         30-49  179   (6.4%)  557  (19.9%)  18   (6.5%)
##         50-69    2   (0.1%)   91   (3.2%)   2   (0.7%)
##           70+    0   (0.0%)    5   (0.2%)   1   (0.4%)
##          <NA>    0   (0.0%)    0   (0.0%)  86  (30.9%)
##         Total 2807 (100.0%) 2803 (100.0%) 278 (100.0%)

Printing the tabyl

By default, the tabyl will print raw to your R console.

Alternatively, you can pass the tabyl to flextable or similar package to print as a “pretty” image in the RStudio Viewer, which could be exported as .png, .jpeg, .html, etc. This is discussed in the page Tables for presentation. Note that if printing in this manner and using adorn_titles(), you must specify placement = "combined".

linelist %>%
  tabyl(age_cat, gender) %>% 
  adorn_totals(where = "col") %>% 
  adorn_percentages(denominator = "col") %>% 
  adorn_pct_formatting() %>% 
  adorn_ns(position = "front") %>% 
  adorn_title(
    row_name = "Age Category",
    col_name = "Gender",
    placement = "combined") %>% # this is necessary to print as image
  flextable::flextable() %>%    # convert to pretty image
  flextable::autofit()          # format to one line per row 

Use on other tables

You can use janitor’s adorn_*() functions on other tables, such as those created by summarise() and count() from dplyr, or table() from base R. Simply pipe the table to the desired janitor function. For example:

linelist %>% 
  count(hospital) %>%   # dplyr function
  adorn_totals()        # janitor function
##                              hospital    n
##                      Central Hospital  454
##                     Military Hospital  896
##                               Missing 1469
##                                 Other  885
##                         Port Hospital 1762
##  St. Mark's Maternity Hospital (SMMH)  422
##                                 Total 5888

Saving the tabyl

If you convert the table to a “pretty” image with a package like flextable, you can save it with functions from that package - like save_as_html(), save_as_word(), save_as_ppt(), and save_as_image() from flextable (as discussed more extensively in the Tables for presentation page). Below, the table is saved as a Word document, in which it can be further hand-edited.

linelist %>%
  tabyl(age_cat, gender) %>% 
  adorn_totals(where = "col") %>% 
  adorn_percentages(denominator = "col") %>% 
  adorn_pct_formatting() %>% 
  adorn_ns(position = "front") %>% 
  adorn_title(
    row_name = "Age Category",
    col_name = "Gender",
    placement = "combined") %>% 
  flextable::flextable() %>%                     # convert to image
  flextable::autofit() %>%                       # ensure only one line per row
  flextable::save_as_docx(path = "tabyl.docx")   # save as Word document to filepath

Statistics

You can apply statistical tests on tabyls, like chisq.test() or fisher.test() from the stats package, as shown below. Note missing values are not allowed so they are excluded from the tabyl with show_na = FALSE.

age_by_outcome <- linelist %>% 
  tabyl(age_cat, outcome, show_na = FALSE) 

chisq.test(age_by_outcome)
## 
##  Pearson's Chi-squared test
## 
## data:  age_by_outcome
## X-squared = 6.4931, df = 7, p-value = 0.4835

See the page on Simple statistical tests for more code and tips about statistics.

Other tips

  • Include the argument na.rm = TRUE to exclude missing values from any of the above calculations.
  • If applying any adorn_*() helper functions to tables not created by tabyl(), you can specify particular column(s) to apply them to like adorn_percentage(,,,c(cases,deaths)) (specify them to the 4th unnamed argument). The syntax is not simple. Consider using summarise() instead.
  • You can read more detail in the janitor page and this tabyl vignette.

17.4 dplyr package

dplyr is part of the tidyverse packages and is an very common data management tool. Creating tables with dplyr functions summarise() and count() is a useful approach to calculating summary statistics, summarize by group, or pass tables to ggplot().

summarise() creates a new, summary data frame. If the data are ungrouped, it will return a one-row dataframe with the specified summary statistics of the entire data frame. If the data are grouped, the new data frame will have one row per group (see Grouping data page).

Within the summarise() parentheses, you provide the names of each new summary column followed by an equals sign and a statistical function to apply.

TIP: The summarise function works with both UK and US spelling (summarise() and summarize()).

Get counts

The most simple function to apply within summarise() is n(). Leave the parentheses empty to count the number of rows.

linelist %>%                 # begin with linelist
  summarise(n_rows = n())    # return new summary dataframe with column n_rows
##   n_rows
## 1   5888

This gets more interesting if we have grouped the data beforehand.

linelist %>% 
  group_by(age_cat) %>%     # group data by unique values in column age_cat
  summarise(n_rows = n())   # return number of rows *per group*
## # A tibble: 9 x 2
##   age_cat n_rows
##   <fct>    <int>
## 1 0-4       1095
## 2 5-9       1095
## 3 10-14      941
## 4 15-19      743
## 5 20-29     1073
## 6 30-49      754
## 7 50-69       95
## 8 70+          6
## 9 <NA>        86

The above command can be shortened by using the count() function instead. count() does the following:

  1. Groups the data by the columns provided to it
  2. Summarises them with n() (creating column n)
  3. Un-groups the data
linelist %>% 
  count(age_cat)
##   age_cat    n
## 1     0-4 1095
## 2     5-9 1095
## 3   10-14  941
## 4   15-19  743
## 5   20-29 1073
## 6   30-49  754
## 7   50-69   95
## 8     70+    6
## 9    <NA>   86

You can change the name of the counts column from the default n to something else by specifying it to name =.

Tabulating counts of two or more grouping columns are still returned in “long” format, with the counts in the n column. See the page on Pivoting data to learn about “long” and “wide” data formats.

linelist %>% 
  count(age_cat, outcome)
##    age_cat outcome   n
## 1      0-4   Death 471
## 2      0-4 Recover 364
## 3      0-4    <NA> 260
## 4      5-9   Death 476
## 5      5-9 Recover 391
## 6      5-9    <NA> 228
## 7    10-14   Death 438
## 8    10-14 Recover 303
## 9    10-14    <NA> 200
## 10   15-19   Death 323
## 11   15-19 Recover 251
## 12   15-19    <NA> 169
## 13   20-29   Death 477
## 14   20-29 Recover 367
## 15   20-29    <NA> 229
## 16   30-49   Death 329
## 17   30-49 Recover 238
## 18   30-49    <NA> 187
## 19   50-69   Death  33
## 20   50-69 Recover  38
## 21   50-69    <NA>  24
## 22     70+   Death   3
## 23     70+ Recover   3
## 24    <NA>   Death  32
## 25    <NA> Recover  28
## 26    <NA>    <NA>  26

Show all levels

If you are tabling a column of class factor you can ensure that all levels are shown (not just the levels with values in the data) by adding .drop = FALSE into the summarise() or count() command.

This technique is useful to standardise your tables/plots. For example if you are creating figures for multiple sub-groups, or repeatedly creating the figure for routine reports. In each of these circumstances, the presence of values in the data may fluctuate, but you can define levels that remain constant.

See the page on Factors for more information.

Proportions

Proportions can be added by piping the table to mutate() to create a new column. Define the new column as the counts column (n by default) divided by the sum() of the counts column (this will return a proportion).

Note that in this case, sum() in the mutate() command will return the sum of the whole column n for use as the proportion denominator. As explained in the Grouping data page, if sum() is used in grouped data (e.g. if the mutate() immediately followed a group_by() command), it will return sums by group. As stated just above, count() finishes its actions by ungrouping. Thus, in this scenario we get full column proportions.

To easily display percents, you can wrap the proportion in the function percent() from the package scales (note this convert to class character).

age_summary <- linelist %>% 
  count(age_cat) %>%                     # group and count by gender (produces "n" column)
  mutate(                                # create percent of column - note the denominator
    percent = scales::percent(n / sum(n))) 

# print
age_summary
##   age_cat    n percent
## 1     0-4 1095  18.60%
## 2     5-9 1095  18.60%
## 3   10-14  941  15.98%
## 4   15-19  743  12.62%
## 5   20-29 1073  18.22%
## 6   30-49  754  12.81%
## 7   50-69   95   1.61%
## 8     70+    6   0.10%
## 9    <NA>   86   1.46%

Below is a method to calculate proportions within groups. It relies on different levels of data grouping being selectively applied and removed. First, the data are grouped on outcome via group_by(). Then, count() is applied. This function further groups the data by age_cat and returns counts for each outcome-age-cat combination. Importantly - as it finishes its process, count() also ungroups the age_cat grouping, so the only remaining data grouping is the original grouping by outcome. Thus, the final step of calculating proportions (denominator sum(n)) is still grouped by outcome.

age_by_outcome <- linelist %>%                  # begin with linelist
  group_by(outcome) %>%                         # group by outcome 
  count(age_cat) %>%                            # group and count by age_cat, and then remove age_cat grouping
  mutate(percent = scales::percent(n / sum(n))) # calculate percent - note the denominator is by outcome group

Plotting

To display a “long” table output like the above with ggplot() is relatively straight-forward. The data are naturally in “long” format, which is naturally accepted by ggplot(). See further examples in the pages ggplot basics and ggplot tips.

linelist %>%                      # begin with linelist
  count(age_cat, outcome) %>%     # group and tabulate counts by two columns
  ggplot()+                       # pass new data frame to ggplot
    geom_col(                     # create bar plot
      mapping = aes(   
        x = outcome,              # map outcome to x-axis
        fill = age_cat,           # map age_cat to the fill
        y = n))                   # map the counts column `n` to the height

Summary statistics

One major advantage of dplyr and summarise() is the ability to return more advanced statistical summaries like median(), mean(), max(), min(), sd() (standard deviation), and percentiles. You can also use sum() to return the number of rows that meet certain logical criteria. As above, these outputs can be produced for the whole data frame set, or by group.

The syntax is the same - within the summarise() parentheses you provide the names of each new summary column followed by an equals sign and a statistical function to apply. Within the statistical function, give the column(s) to be operated on and any relevant arguments (e.g. na.rm = TRUE for most mathematical functions).

You can also use sum() to return the number of rows that meet a logical criteria. The expression within is counted if it evaluates to TRUE. For example:

  • sum(age_years < 18, na.rm=T)
  • sum(gender == "male", na.rm=T)
  • sum(response %in% c("Likely", "Very Likely"))

Below, linelist data are summarised to describe the days delay from symptom onset to hospital admission (column days_onset_hosp), by hospital.

summary_table <- linelist %>%                                        # begin with linelist, save out as new object
  group_by(hospital) %>%                                             # group all calculations by hospital
  summarise(                                                         # only the below summary columns will be returned
    cases       = n(),                                                # number of rows per group
    delay_max   = max(days_onset_hosp, na.rm = T),                    # max delay
    delay_mean  = round(mean(days_onset_hosp, na.rm=T), digits = 1),  # mean delay, rounded
    delay_sd    = round(sd(days_onset_hosp, na.rm = T), digits = 1),  # standard deviation of delays, rounded
    delay_3     = sum(days_onset_hosp >= 3, na.rm = T),               # number of rows with delay of 3 or more days
    pct_delay_3 = scales::percent(delay_3 / cases)                    # convert previously-defined delay column to percent 
  )

summary_table  # print
## # A tibble: 6 x 7
##   hospital                             cases delay_max delay_mean delay_sd delay_3 pct_delay_3
##   <chr>                                <int>     <dbl>      <dbl>    <dbl>   <int> <chr>      
## 1 Central Hospital                       454        12        1.9      1.9     108 24%        
## 2 Military Hospital                      896        15        2.1      2.4     253 28%        
## 3 Missing                               1469        22        2.1      2.3     399 27%        
## 4 Other                                  885        18        2        2.2     234 26%        
## 5 Port Hospital                         1762        16        2.1      2.2     470 27%        
## 6 St. Mark's Maternity Hospital (SMMH)   422        18        2.1      2.3     116 27%

Some tips:

  • Use sum() with a logic statement to “count” rows that meet certain criteria (==)
  • Note the use of na.rm = TRUE within mathematical functions like sum(), otherwise NA will be returned if there are any missing values
  • Use the function percent() from the scales package to easily convert to percents
    • Set accuracy = to 0.1 or 0.01 to ensure 1 or 2 decimal places respectively
  • Use round() from base R to specify decimals
  • To calculate these statistics on the entire dataset, use summarise() without group_by()
  • You may create columns for the purposes of later calculations (e.g. denominators) that you eventually drop from your data frame with select().

Conditional statistics

You may want to return conditional statistics - e.g. the maximum of rows that meet certain criteria. This can be done by subsetting the column with brackets [ ]. The example below returns the maximum temperature for patients classified having or not having fever. Be aware however - it may be more appropriate to add another column to the group_by() command and pivot_wider() (as demonstrated below).

linelist %>% 
  group_by(hospital) %>% 
  summarise(
    max_temp_fvr = max(temp[fever == "yes"], na.rm = T),
    max_temp_no = max(temp[fever == "no"], na.rm = T)
  )
## # A tibble: 6 x 3
##   hospital                             max_temp_fvr max_temp_no
##   <chr>                                       <dbl>       <dbl>
## 1 Central Hospital                             40.4        38  
## 2 Military Hospital                            40.5        38  
## 3 Missing                                      40.6        38  
## 4 Other                                        40.8        37.9
## 5 Port Hospital                                40.6        38  
## 6 St. Mark's Maternity Hospital (SMMH)         40.6        37.9

Glueing together

The function str_glue() from stringr is useful to combine values from several columns into one new column. In this context this is typically used after the summarise() command.

In the Characters and strings page, various options for combining columns are discussed, including unite(), and paste0(). In this use case, we advocate for str_glue() because it is more flexible than unite() and has more simple syntax than paste0().

Below, the summary_table data frame (created above) is mutated such that columns delay_mean and delay_sd are combined, parentheses formating is added to the new column, and their respective old columns are removed.

Then, to make the table more presentable, a total row is added with adorn_totals() from janitor (which ignores non-numeric columns). Lastly, we use select() from dplyr to both re-order and rename to nicer column names.

Now you could pass to flextable and print the table to Word, .png, .jpeg, .html, Powerpoint, RMarkdown, etc.! (see the Tables for presentation page).

summary_table %>% 
  mutate(delay = str_glue("{delay_mean} ({delay_sd})")) %>%  # combine and format other values
  select(-c(delay_mean, delay_sd)) %>%                       # remove two old columns   
  adorn_totals(where = "row") %>%                            # add total row
  select(                                                    # order and rename cols
    "Hospital Name"   = hospital,
    "Cases"           = cases,
    "Max delay"       = delay_max,
    "Mean (sd)"       = delay,
    "Delay 3+ days"   = delay_3,
    "% delay 3+ days" = pct_delay_3
    )
##                         Hospital Name Cases Max delay Mean (sd) Delay 3+ days % delay 3+ days
##                      Central Hospital   454        12 1.9 (1.9)           108             24%
##                     Military Hospital   896        15 2.1 (2.4)           253             28%
##                               Missing  1469        22 2.1 (2.3)           399             27%
##                                 Other   885        18   2 (2.2)           234             26%
##                         Port Hospital  1762        16 2.1 (2.2)           470             27%
##  St. Mark's Maternity Hospital (SMMH)   422        18 2.1 (2.3)           116             27%
##                                 Total  5888       101         -          1580               -

Percentiles

Percentiles and quantiles in dplyr deserve a special mention. To return quantiles, use quantile() with the defaults or specify the value(s) you would like with probs =.

# get default percentile values of age (0%, 25%, 50%, 75%, 100%)
linelist %>% 
  summarise(age_percentiles = quantile(age_years, na.rm = TRUE))
##   age_percentiles
## 1               0
## 2               6
## 3              13
## 4              23
## 5              84
# get manually-specified percentile values of age (5%, 50%, 75%, 98%)
linelist %>% 
  summarise(
    age_percentiles = quantile(
      age_years,
      probs = c(.05, 0.5, 0.75, 0.98), 
      na.rm=TRUE)
    )
##   age_percentiles
## 1               1
## 2              13
## 3              23
## 4              48

If you want to return quantiles by group, you may encounter long and less useful outputs if you simply add another column to group_by(). So, try this approach instead - create a column for each quantile level desired.

# get manually-specified percentile values of age (5%, 50%, 75%, 98%)
linelist %>% 
  group_by(hospital) %>% 
  summarise(
    p05 = quantile(age_years, probs = 0.05, na.rm=T),
    p50 = quantile(age_years, probs = 0.5, na.rm=T),
    p75 = quantile(age_years, probs = 0.75, na.rm=T),
    p98 = quantile(age_years, probs = 0.98, na.rm=T)
    )
## # A tibble: 6 x 5
##   hospital                               p05   p50   p75   p98
##   <chr>                                <dbl> <dbl> <dbl> <dbl>
## 1 Central Hospital                         1    12    21  48  
## 2 Military Hospital                        1    13    24  45  
## 3 Missing                                  1    13    23  48.2
## 4 Other                                    1    13    23  50  
## 5 Port Hospital                            1    14    24  49  
## 6 St. Mark's Maternity Hospital (SMMH)     2    12    22  50.2

While dplyr summarise() certainly offers more fine control, you may find that all the summary statistics you need can be produced with get_summary_stat() from the rstatix package. If operating on grouped data, if will return 0%, 25%, 50%, 75%, and 100%. If applied to ungrouped data, you can specify the percentiles with probs = c(.05, .5, .75, .98).

linelist %>% 
  group_by(hospital) %>% 
  rstatix::get_summary_stats(age, type = "quantile")
## `mutate_if()` ignored the following grouping variables:
## Column `variable`
## `mutate_if()` ignored the following grouping variables:
## Column `variable`
## `mutate_if()` ignored the following grouping variables:
## Column `variable`
## `mutate_if()` ignored the following grouping variables:
## Column `variable`
## `mutate_if()` ignored the following grouping variables:
## Column `variable`
## `mutate_if()` ignored the following grouping variables:
## Column `variable`
## # A tibble: 6 x 8
##   hospital                             variable     n  `0%` `25%` `50%` `75%` `100%`
##   <chr>                                <chr>    <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 Central Hospital                     age        445     0     6    12    21     58
## 2 Military Hospital                    age        884     0     6    14    24     72
## 3 Missing                              age       1441     0     6    13    23     76
## 4 Other                                age        873     0     6    13    23     69
## 5 Port Hospital                        age       1739     0     6    14    24     68
## 6 St. Mark's Maternity Hospital (SMMH) age        420     0     7    12    22     84
linelist %>% 
  rstatix::get_summary_stats(age, type = "quantile")
## `mutate_if()` ignored the following grouping variables:
## Column `variable`
## # A tibble: 1 x 7
## # Groups:   variable [1]
##   variable     n  `0%` `25%` `50%` `75%` `100%`
##   <chr>    <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 age       5802     0     6    13    23     84

Summarise aggregated data

If you begin with aggregated data, using n() return the number of rows, not the sum of the aggregated counts. To get sums, use sum() on the data’s counts column.

For example, let’s say you are beginning with the data frame of counts below, called linelist_agg - it shows in “long” format the case counts by outcome and gender.

Below we create this example data frame of linelist case counts by outcome and gender (missing values removed for clarity).

linelist_agg <- linelist %>% 
  drop_na(gender, outcome) %>% 
  count(outcome, gender)

linelist_agg
##   outcome gender    n
## 1   Death      f 1227
## 2   Death      m 1228
## 3 Recover      f  953
## 4 Recover      m  950

To sum the counts (in column n) by group you can use summarise() but set the new column equal to sum(n, na.rm=T). To add a conditional element to the sum operation, you can use the subset bracket [ ] syntax on the counts column.

linelist_agg %>% 
  group_by(outcome) %>% 
  summarise(
    total_cases  = sum(n, na.rm=T),
    male_cases   = sum(n[gender == "m"], na.rm=T),
    female_cases = sum(n[gender == "f"], na.rm=T))
## # A tibble: 2 x 4
##   outcome total_cases male_cases female_cases
##   <chr>         <int>      <int>        <int>
## 1 Death          2455       1228         1227
## 2 Recover        1903        950          953

across() multiple columns

You can use summarise() across multiple columns using across(). This makes life easier when you want to calculate the same statistics for many columns. Place across() within summarise() and specify the following:

  • .cols = as either a vector of column names c() or “tidyselect” helper functions (explained below)
  • .fns = the function to perform (no parentheses) - you can provide multiple within a list()

Below, mean() is applied to several numeric columns. A vector of columns are named explicitly to .cols = and a single function mean is specified (no parentheses) to .fns =. Any additional arguments for the function (e.g. na.rm=TRUE) are provided after .fns =, separated by a comma.

It can be difficult to get the order of parentheses and commas correct when using across(). Remember that within across() you must include the columns, the functions, and any extra arguments needed for the functions.

linelist %>% 
  group_by(outcome) %>% 
  summarise(across(.cols = c(age_years, temp, wt_kg, ht_cm),  # columns
                   .fns = mean,                               # function
                   na.rm=T))                                  # extra arguments
## # A tibble: 3 x 5
##   outcome age_years  temp wt_kg ht_cm
##   <chr>       <dbl> <dbl> <dbl> <dbl>
## 1 Death        15.9  38.6  52.6  125.
## 2 Recover      16.1  38.6  52.5  125.
## 3 <NA>         16.2  38.6  53.0  125.

Multiple functions can be run at once. Below the functions mean and sd are provided to .fns = within a list(). You have the opportunity to provide character names (e.g. “mean” and “sd”) which are appended in the new column names.

linelist %>% 
  group_by(outcome) %>% 
  summarise(across(.cols = c(age_years, temp, wt_kg, ht_cm), # columns
                   .fns = list("mean" = mean, "sd" = sd),    # multiple functions 
                   na.rm=T))                                 # extra arguments
## # A tibble: 3 x 9
##   outcome age_years_mean age_years_sd temp_mean temp_sd wt_kg_mean wt_kg_sd ht_cm_mean ht_cm_sd
##   <chr>            <dbl>        <dbl>     <dbl>   <dbl>      <dbl>    <dbl>      <dbl>    <dbl>
## 1 Death             15.9         12.3      38.6   0.962       52.6     18.4       125.     48.7
## 2 Recover           16.1         13.0      38.6   0.997       52.5     18.6       125.     50.1
## 3 <NA>              16.2         12.8      38.6   0.976       53.0     18.9       125.     50.4

Here are those “tidyselect” helper functions you can provide to .cols = to select columns:

  • everything() - all other columns not mentioned
  • last_col() - the last column
  • where() - applies a function to all columns and selects those which are TRUE
  • starts_with() - matches to a specified prefix. Example: starts_with("date")
  • ends_with() - matches to a specified suffix. Example: ends_with("_end")
  • contains() - columns containing a character string. Example: contains("time")
  • matches() - to apply a regular expression (regex). Example: contains("[pt]al")
  • num_range() -
  • any_of() - matches if column is named. Useful if the name might not exist. Example: any_of(date_onset, date_death, cardiac_arrest)

For example, to return the mean of every numeric column use where() and provide the function as.numeric() (without parentheses). All this remains within the across() command.

linelist %>% 
  group_by(outcome) %>% 
  summarise(across(
    .cols = where(is.numeric),  # all numeric columns in the data frame
    .fns = mean,
    na.rm=T))
## # A tibble: 3 x 12
##   outcome generation   age age_years   lon   lat wt_kg ht_cm ct_blood  temp   bmi days_onset_hosp
##   <chr>        <dbl> <dbl>     <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl> <dbl> <dbl>           <dbl>
## 1 Death         16.7  15.9      15.9 -13.2  8.47  52.6  125.     21.3  38.6  45.6            1.84
## 2 Recover       16.4  16.2      16.1 -13.2  8.47  52.5  125.     21.1  38.6  47.7            2.34
## 3 <NA>          16.5  16.3      16.2 -13.2  8.47  53.0  125.     21.2  38.6  48.3            2.07

Pivot wider

If you prefer your table in “wide” format you can transform it using the tidyr pivot_wider() function. You will likely need to re-name the columns with rename(). For more information see the page on Pivoting data.

The example below begins with the “long” table age_by_outcome from the proportions section. We create it again and print, for clarity:

age_by_outcome <- linelist %>%                  # begin with linelist
  group_by(outcome) %>%                         # group by outcome 
  count(age_cat) %>%                            # group and count by age_cat, and then remove age_cat grouping
  mutate(percent = scales::percent(n / sum(n))) # calculate percent - note the denominator is by outcome group

To pivot wider, we create the new columns from the values in the existing column age_cat (by setting names_from = age_cat). We also specify that the new table values will come from the existing column n, with values_from = n. The columns not mentioned in our pivoting command (outcome) will remain unchanged on the far left side.

age_by_outcome %>% 
  select(-percent) %>%   # keep only counts for simplicity
  pivot_wider(names_from = age_cat, values_from = n)  
## # A tibble: 3 x 10
## # Groups:   outcome [3]
##   outcome `0-4` `5-9` `10-14` `15-19` `20-29` `30-49` `50-69` `70+`  `NA`
##   <chr>   <int> <int>   <int>   <int>   <int>   <int>   <int> <int> <int>
## 1 Death     471   476     438     323     477     329      33     3    32
## 2 Recover   364   391     303     251     367     238      38     3    28
## 3 <NA>      260   228     200     169     229     187      24    NA    26

Total rows

When summarise() operates on grouped data it does not automatically produce “total” statistics. Below, two approaches to adding a total row are presented:

janitor’s adorn_totals()

If your table consists only of counts or proportions/percents that can be summed into a total, then you can add sum totals using janitor’s adorn_totals() as described in the section above. Note that this function can only sum the numeric columns - if you want to calculate other total summary statistics see the next approach with dplyr.

Below, linelist is grouped by gender and summarised into a table that described the number of cases with known outcome, deaths, and recovered. Piping the table to adorn_totals() adds a total row at the bottom reflecting the sum of each column. The further adorn_*() functions adjust the display as noted in the code.

linelist %>% 
  group_by(gender) %>%
  summarise(
    known_outcome = sum(!is.na(outcome)),           # Number of rows in group where outcome is not missing
    n_death  = sum(outcome == "Death", na.rm=T),    # Number of rows in group where outcome is Death
    n_recover = sum(outcome == "Recover", na.rm=T), # Number of rows in group where outcome is Recovered
  ) %>% 
  adorn_totals() %>%                                # Adorn total row (sums of each numeric column)
  adorn_percentages("col") %>%                      # Get column proportions
  adorn_pct_formatting() %>%                        # Convert proportions to percents
  adorn_ns(position = "front")                      # display % and counts (with counts in front)
##  gender known_outcome       n_death     n_recover
##       f 2180  (47.8%) 1227  (47.5%)  953  (48.1%)
##       m 2178  (47.7%) 1228  (47.6%)  950  (47.9%)
##    <NA>  207   (4.5%)  127   (4.9%)   80   (4.0%)
##   Total 4565 (100.0%) 2582 (100.0%) 1983 (100.0%)

summarise() on “total” data and then bind_rows()

If your table consists of summary statistics such as median(), mean(), etc, the adorn_totals() approach shown above will not be sufficient. Instead, to get summary statistics for the entire dataset you must calculate them with a separate summarise() command and then bind the results to the original grouped summary table. To do the binding you can use bind_rows() from dplyr s described in the Joining data page. Below is an example:

You can make a summary table of outcome by hospital with group_by() and summarise() like this:

by_hospital <- linelist %>% 
  filter(!is.na(outcome) & hospital != "Missing") %>%  # Remove cases with missing outcome or hospital
  group_by(hospital, outcome) %>%                      # Group data
  summarise(                                           # Create new summary columns of indicators of interest
    N = n(),                                            # Number of rows per hospital-outcome group     
    ct_value = median(ct_blood, na.rm=T))               # median CT value per group
  
by_hospital # print table
## # A tibble: 10 x 4
## # Groups:   hospital [5]
##    hospital                             outcome     N ct_value
##    <chr>                                <chr>   <int>    <dbl>
##  1 Central Hospital                     Death     193       22
##  2 Central Hospital                     Recover   165       22
##  3 Military Hospital                    Death     399       21
##  4 Military Hospital                    Recover   309       22
##  5 Other                                Death     395       22
##  6 Other                                Recover   290       21
##  7 Port Hospital                        Death     785       22
##  8 Port Hospital                        Recover   579       21
##  9 St. Mark's Maternity Hospital (SMMH) Death     199       22
## 10 St. Mark's Maternity Hospital (SMMH) Recover   126       22

To get the totals, run the same summarise() command but only group the data by outcome (not by hospital), like this:

totals <- linelist %>% 
      filter(!is.na(outcome) & hospital != "Missing") %>%
      group_by(outcome) %>%                            # Grouped only by outcome, not by hospital    
      summarise(
        N = n(),                                       # These statistics are now by outcome only     
        ct_value = median(ct_blood, na.rm=T))

totals # print table
## # A tibble: 2 x 3
##   outcome     N ct_value
##   <chr>   <int>    <dbl>
## 1 Death    1971       22
## 2 Recover  1469       22

We can bind these two data frames together. Note that by_hospital has 4 columns whereas totals has 3 columns. By using bind_rows(), the columns are combined by name, and any extra space is filled in with NA (e.g the column hospital values for the two new totals rows). After binding the rows, we convert these empty spaces to “Total” using replace_na() (see Cleaning data and core functions page).

table_long <- bind_rows(by_hospital, totals) %>% 
  mutate(hospital = replace_na(hospital, "Total"))

Here is the new table with “Total” rows at the bottom.

This table is in a “long” format, which may be what you want. Optionally, you can pivot this table wider to make it more readable. See the section on pivoting wider above, and the Pivoting data page. You can also add more columns, and arrange it nicely. This code is below.

table_long %>% 
  
  # Pivot wider and format
  ########################
  mutate(hospital = replace_na(hospital, "Total")) %>% 
  pivot_wider(                                         # Pivot from long to wide
    values_from = c(ct_value, N),                       # new values are from ct and count columns
    names_from = outcome) %>%                           # new column names are from outcomes
  mutate(                                              # Add new columns
    N_Known = N_Death + N_Recover,                               # number with known outcome
    Pct_Death = scales::percent(N_Death / N_Known, 0.1),         # percent cases who died (to 1 decimal)
    Pct_Recover = scales::percent(N_Recover / N_Known, 0.1)) %>% # percent who recovered (to 1 decimal)
  select(                                              # Re-order columns
    hospital, N_Known,                                   # Intro columns
    N_Recover, Pct_Recover, ct_value_Recover,            # Recovered columns
    N_Death, Pct_Death, ct_value_Death)  %>%             # Death columns
  arrange(N_Known)                                  # Arrange rows from lowest to highest (Total row at bottom)
## # A tibble: 6 x 8
## # Groups:   hospital [6]
##   hospital                             N_Known N_Recover Pct_Recover ct_value_Recover N_Death Pct_Death ct_value_Death
##   <chr>                                  <int>     <int> <chr>                  <dbl>   <int> <chr>              <dbl>
## 1 St. Mark's Maternity Hospital (SMMH)     325       126 38.8%                     22     199 61.2%                 22
## 2 Central Hospital                         358       165 46.1%                     22     193 53.9%                 22
## 3 Other                                    685       290 42.3%                     21     395 57.7%                 22
## 4 Military Hospital                        708       309 43.6%                     22     399 56.4%                 21
## 5 Port Hospital                           1364       579 42.4%                     21     785 57.6%                 22
## 6 Total                                   3440      1469 42.7%                     22    1971 57.3%                 22

And then you can print this nicely as an image - below is the output printed with flextable. You can read more in depth about this example and how to achieve this “pretty” table in the Tables for presentation page.

17.5 gtsummary package

If you want to print your summary statistics in a pretty, publication-ready graphic, you can use the gtsummary package and its function tbl_summary(). The code can seem complex at first, but the outputs look very nice and print to your RStudio Viewer panel as an HTML image. Read a vignette here.

You can also add the results of statistical tests to gtsummary tables. This process is described in the gtsummary section of the Simple statistical tests page.

To introduce tbl_summary() we will show the most basic behavior first, which actually produces a large and beautiful table. Then, we will examine in detail how to make adjustments and more tailored tables.

Summary table

The default behavior of tbl_summary() is quite incredible - it takes the columns you provide and creates a summary table in one command. The function prints statistics appropriate to the column class: median and inter-quartile range (IQR) for numeric columns, and counts (%) for categorical columns. Missing values are converted to “Unknown”. Footnotes are added to the bottom to explain the statistics, while the total N is shown at the top.

linelist %>% 
  select(age_years, gender, outcome, fever, temp, hospital) %>%  # keep only the columns of interest
  tbl_summary()                                                  # default
Characteristic N = 5,8881
age_years 13 (6, 23)
Unknown 86
gender
f 2,807 (50%)
m 2,803 (50%)
Unknown 278
outcome
Death 2,582 (57%)
Recover 1,983 (43%)
Unknown 1,323
fever 4,549 (81%)
Unknown 249
temp 38.80 (38.20, 39.20)
Unknown 149
hospital
Central Hospital 454 (7.7%)
Military Hospital 896 (15%)
Missing 1,469 (25%)
Other 885 (15%)
Port Hospital 1,762 (30%)
St. Mark's Maternity Hospital (SMMH) 422 (7.2%)

1 Median (IQR); n (%)

Adjustments

Now we will explain how the function works and how to make adjustments. The key arguments are detailed below:

by =
You can stratify your table by a column (e.g. by outcome), creating a 2-way table.

statistic =
Use an equations to specify which statistics to show and how to display them. There are two sides to the equation, separated by a tilde ~. On the right side, in quotes, is the statistical display desired, and on the left are the columns to which that display will apply.

  • The right side of the equation uses the syntax of str_glue() from stringr (see Characters and Strings), with the desired display string in quotes and the statistics themselves within curly brackets. You can include statistics like “n” (for counts), “N” (for denominator), “mean”, “median”, “sd”, “max”, “min”, percentiles as “p##” like “p25”, or percent of total as “p”. See ?tbl_summary for details.
  • For the left side of the equation, you can specify columns by name (e.g. age or c(age, gender)) or using helpers such as all_continuous(), all_categorical(), contains(), starts_with(), etc.

A simple example of a statistic = equation might look like below, to only print the mean of column age_years:

linelist %>% 
  select(age_years) %>%         # keep only columns of interest 
  tbl_summary(                  # create summary table
    statistic = age_years ~ "{mean}") # print mean of age
Characteristic N = 5,8881
age_years 16
Unknown 86

1 Mean

A slightly more complex equation might look like "({min}, {max})", incorporating the max and min values within parentheses and separated by a comma:

linelist %>% 
  select(age_years) %>%                       # keep only columns of interest 
  tbl_summary(                                # create summary table
    statistic = age_years ~ "({min}, {max})") # print min and max of age
Characteristic N = 5,8881
age_years (0, 84)
Unknown 86

1 (Range)

You can also differentiate syntax for separate columns or types of columns. In the more complex example below, the value provided to statistc = is a list indicating that for all continuous columns the table should print mean with standard deviation in parentheses, while for all categorical columns it should print the n, denominator, and percent.

digits =
Adjust the digits and rounding. Optionally, this can be specified to be for continuous columns only (as below).

label =
Adjust how the column name should be displayed. Provide the column name and its desired label separated by a tilde. The default is the column name.

missing_text =
Adjust how missing values are displayed. The default is “Unknown”.

type =
This is used to adjust how many levels of the statistics are shown. The syntax is similar to statistic = in that you provide an equation with columns on the left and a value on the right. Two common scenarios include:

  • type = all_categorical() ~ "categorical" Forces dichotomous columns (e.g. fever yes/no) to show all levels instead of only the “yes” row
  • type = all_continuous() ~ "continuous2" Allows multi-line statistics per variable, as shown in a later section

In the example below, each of these arguments is used to modify the original summary table:

linelist %>% 
  select(age_years, gender, outcome, fever, temp, hospital) %>% # keep only columns of interest
  tbl_summary(     
    by = outcome,                                               # stratify entire table by outcome
    statistic = list(all_continuous() ~ "{mean} ({sd})",        # stats and format for continuous columns
                     all_categorical() ~ "{n} / {N} ({p}%)"),   # stats and format for categorical columns
    digits = all_continuous() ~ 1,                              # rounding for continuous columns
    type   = all_categorical() ~ "categorical",                 # force all categorical levels to display
    label  = list(                                              # display labels for column names
      outcome   ~ "Outcome",                           
      age_years ~ "Age (years)",
      gender    ~ "Gender",
      temp      ~ "Temperature",
      hospital  ~ "Hospital"),
    missing_text = "Missing"                                    # how missing values should display
  )
## 1323 observations missing `outcome` have been removed. To include these observations, use `forcats::fct_explicit_na()` on `outcome` column before passing to `tbl_summary()`.
Characteristic Death, N = 2,5821 Recover, N = 1,9831
Age (years) 15.9 (12.3) 16.1 (13.0)
Missing 32 28
Gender
f 1,227 / 2,455 (50%) 953 / 1,903 (50%)
m 1,228 / 2,455 (50%) 950 / 1,903 (50%)
Missing 127 80
fever
no 458 / 2,460 (19%) 361 / 1,904 (19%)
yes 2,002 / 2,460 (81%) 1,543 / 1,904 (81%)
Missing 122 79
Temperature 38.6 (1.0) 38.6 (1.0)
Missing 60 55
Hospital
Central Hospital 193 / 2,582 (7.5%) 165 / 1,983 (8.3%)
Military Hospital 399 / 2,582 (15%) 309 / 1,983 (16%)
Missing 611 / 2,582 (24%) 514 / 1,983 (26%)
Other 395 / 2,582 (15%) 290 / 1,983 (15%)
Port Hospital 785 / 2,582 (30%) 579 / 1,983 (29%)
St. Mark's Maternity Hospital (SMMH) 199 / 2,582 (7.7%) 126 / 1,983 (6.4%)

1 Mean (SD); n / N (%)

Multi-line stats for continuous variables

If you want to print multiple lines of statistics for continuous variables, you can indicate this by setting the type = to “continuous2”. You can combine all of the previously shown elements in one table by choosing which statistics you want to show. To do this you need to tell the function that you want to get a table back by entering the type as “continuous2”. The number of missing values is shown as “Unknown”.

linelist %>% 
  select(age_years, temp) %>%                      # keep only columns of interest
  tbl_summary(                                     # create summary table
    type = all_continuous() ~ "continuous2",       # indicate that you want to print multiple statistics 
    statistic = all_continuous() ~ c(
      "{mean} ({sd})",                             # line 1: mean and SD
      "{median} ({p25}, {p75})",                   # line 2: median and IQR
      "{min}, {max}")                              # line 3: min and max
    )
Characteristic N = 5,888
age_years
Mean (SD) 16 (13)
Median (IQR) 13 (6, 23)
Range 0, 84
Unknown 86
temp
Mean (SD) 38.56 (0.98)
Median (IQR) 38.80 (38.20, 39.20)
Range 35.20, 40.80
Unknown 149

There are many other ways to modify these tables, including adding p-values, adjusting color and headings, etc. Many of these are described in the documentation (enter ?tbl_summary in Console), and some are given in the section on statistical tests.

17.6 base R

You can use the function table() to tabulate and cross-tabulate columns. Unlike the options above, you must specify the dataframe each time you reference a column name, as shown below.

CAUTION: NA (missing) values will not be tabulated unless you include the argument useNA = "always" (which could also be set to “no” or “ifany”).

TIP: You can use the %$% from magrittr to remove the need for repeating data frame calls within base functions. For example the below could be written linelist %$% table(outcome, useNA = "always")

table(linelist$outcome, useNA = "always")
## 
##   Death Recover    <NA> 
##    2582    1983    1323

Multiple columns can be cross-tabulated by listing them one after the other, separated by commas. Optionally, you can assign each column a “name” like Outcome = linelist$outcome.

age_by_outcome <- table(linelist$age_cat, linelist$outcome, useNA = "always") # save table as object
age_by_outcome   # print table
##        
##         Death Recover <NA>
##   0-4     471     364  260
##   5-9     476     391  228
##   10-14   438     303  200
##   15-19   323     251  169
##   20-29   477     367  229
##   30-49   329     238  187
##   50-69    33      38   24
##   70+       3       3    0
##   <NA>     32      28   26

Proportions

To return proportions, passing the above table to the function prop.table(). Use the margins = argument to specify whether you want the proportions to be of rows (1), of columns (2), or of the whole table (3). For clarity, we pipe the table to the round() function from base R, specifying 2 digits.

# get proportions of table defined above, by rows, rounded
prop.table(age_by_outcome, 1) %>% round(2)
##        
##         Death Recover <NA>
##   0-4    0.43    0.33 0.24
##   5-9    0.43    0.36 0.21
##   10-14  0.47    0.32 0.21
##   15-19  0.43    0.34 0.23
##   20-29  0.44    0.34 0.21
##   30-49  0.44    0.32 0.25
##   50-69  0.35    0.40 0.25
##   70+    0.50    0.50 0.00
##   <NA>   0.37    0.33 0.30

Totals

To add row and column totals, pass the table to addmargins(). This works for both counts and proportions.

addmargins(age_by_outcome)
##        
##         Death Recover <NA>  Sum
##   0-4     471     364  260 1095
##   5-9     476     391  228 1095
##   10-14   438     303  200  941
##   15-19   323     251  169  743
##   20-29   477     367  229 1073
##   30-49   329     238  187  754
##   50-69    33      38   24   95
##   70+       3       3    0    6
##   <NA>     32      28   26   86
##   Sum    2582    1983 1323 5888

Convert to data frame

Converting a table() object directly to a data frame is not straight-forward. One approach is demonstrated below:

  1. Create the table, without using useNA = "always". Instead convert NA values to “(Missing)” with fct_explicit_na() from forcats.
  2. Add totals (optional) by piping to addmargins()
  3. Pipe to the base R function as.data.frame.matrix()
  4. Pipe the table to the tibble function rownames_to_column(), specifying the name for the first column
  5. Print, View, or export as desired. In this example we use flextable() from package flextable as described in the Tables for presentation page. This will print to the RStudio viewer pane as a pretty HTML image.
table(fct_explicit_na(linelist$age_cat), fct_explicit_na(linelist$outcome)) %>% 
  addmargins() %>% 
  as.data.frame.matrix() %>% 
  tibble::rownames_to_column(var = "Age Category") %>% 
  flextable::flextable()

17.7 Resources

Much of the information in this page is adapted from these resources and vignettes online:

gtsummary

dplyr

18 Simple statistical tests

This page demonstrates how to conduct simple statistical tests using base R, rstatix, and gtsummary.

  • T-test
  • Shapiro-Wilk test
  • Wilcoxon rank sum test
  • Kruskal-Wallis test
  • Chi-squared test
  • Correlations between numeric variables

…many other tests can be performed, but we showcase just these common ones and link to further documentation.

Each of the above packages bring certain advantages and disadvantages:

  • Use base R functions to print a statistical outputs to the R Console
  • Use rstatix functions to return results in a data frame, or if you want tests to run by group
  • Use gtsummary if you want to quickly print publication-ready tables

18.1 Preparation

Load packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
  rio,          # File import
  here,         # File locator
  skimr,        # get overview of data
  tidyverse,    # data management + ggplot2 graphics, 
  gtsummary,    # summary statistics and tests
  rstatix,      # statistics
  corrr,        # correlation analayis for numeric variables
  janitor,      # adding totals and percents to tables
  flextable     # converting tables to HTML
  )

Import data

We import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import your data with the import() function from the rio package (it accepts many file types like .xlsx, .rds, .csv - see the Import and export page for details).

# import the linelist
linelist <- import("linelist_cleaned.rds")

The first 50 rows of the linelist are displayed below.

18.2 base R

You can use base R functions to conduct statistical tests. The commands are relatively simple and results will print to the R Console for simple viewing. However, the outputs are usually lists and so are harder to manipulate if you want to use the results in subsequent operations.

T-tests

A t-test, also called “Student’s t-Test”, is typically used to determine if there is a significant difference between the means of some numeric variable between two groups. Here we’ll show the syntax to do this test depending on whether the columns are in the same data frame.

Syntax 1: This is the syntax when your numeric and categorical columns are in the same data frame. Provide the numeric column on the left side of the equation and the categorical column on the right side. Specify the dataset to data =. Optionally, set paired = TRUE, and conf.level = (0.95 default), and alternative = (either “two.sided”, “less”, or “greater”). Enter ?t.test for more details.

## compare mean age by outcome group with a t-test
t.test(age_years ~ gender, data = linelist)
## 
##  Welch Two Sample t-test
## 
## data:  age_years by gender
## t = -21.344, df = 4902.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group f and group m is not equal to 0
## 95 percent confidence interval:
##  -7.571920 -6.297975
## sample estimates:
## mean in group f mean in group m 
##        12.60207        19.53701

Syntax 2: You can compare two separate numeric vectors using this alternative syntax. For example, if the two columns are in different data sets.

t.test(df1$age_years, df2$age_years)

You can also use a t-test to determine whether a sample mean is significantly different from some specific value. Here we conduct a one-sample t-test with the known/hypothesized population mean as mu =:

t.test(linelist$age_years, mu = 45)

Shapiro-Wilk test

The Shapiro-Wilk test can be used to determine whether a sample came from a normally-distributed population (an assumption of many other tests and analysis, such as the t-test). However, this can only be used on a sample between 3 and 5000 observations. For larger samples a quantile-quantile plot may be helpful.

shapiro.test(linelist$age_years)

Wilcoxon rank sum test

The Wilcoxon rank sum test, also called the Mann–Whitney U test, is often used to help determine if two numeric samples are from the same distribution when their populations are not normally distributed or have unequal variance.

## compare age distribution by outcome group with a wilcox test
wilcox.test(age_years ~ outcome, data = linelist)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  age_years by outcome
## W = 2501868, p-value = 0.8308
## alternative hypothesis: true location shift is not equal to 0

Kruskal-Wallis test

The Kruskal-Wallis test is an extension of the Wilcoxon rank sum test that can be used to test for differences in the distribution of more than two samples. When only two samples are used it gives identical results to the Wilcoxon rank sum test.

## compare age distribution by outcome group with a kruskal-wallis test
kruskal.test(age_years ~ outcome, linelist)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  age_years by outcome
## Kruskal-Wallis chi-squared = 0.045675, df = 1, p-value = 0.8308

Chi-squared test

Pearson’s Chi-squared test is used in testing for significant differences between categorical croups.

## compare the proportions in each group with a chi-squared test
chisq.test(linelist$gender, linelist$outcome)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  linelist$gender and linelist$outcome
## X-squared = 0.0011841, df = 1, p-value = 0.9725

18.3 rstatix package

The rstatix package offers the ability to run statistical tests and retrieve results in a “pipe-friendly” framework. The results are automatically in a data frame so that you can perform subsequent operations on the results. It is also easy to group the data being passed into the functions, so that the statistics are run for each group.

Summary statistics

The function get_summary_stats() is a quick way to return summary statistics. Simply pipe your dataset to this function and provide the columns to analyse. If no columns are specified, the statistics are calculated for all columns.

By default, a full range of summary statistics are returned: n, max, min, median, 25%ile, 75%ile, IQR, median absolute deviation (mad), mean, standard deviation, standard error, and a confidence interval of the mean.

linelist %>%
  rstatix::get_summary_stats(age, temp)
## # A tibble: 2 x 13
##   variable     n   min   max median    q1    q3   iqr    mad  mean     sd    se    ci
##   <chr>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1 age       5802   0    84     13     6    23      17 11.9    16.1 12.6   0.166 0.325
## 2 temp      5739  35.2  40.8   38.8  38.2  39.2     1  0.741  38.6  0.977 0.013 0.025

You can specify a subset of summary statistics to return by providing one of the following values to type =: “full”, “common”, “robust”, “five_number”, “mean_sd”, “mean_se”, “mean_ci”, “median_iqr”, “median_mad”, “quantile”, “mean”, “median”, “min”, “max”.

It can be used with grouped data as well, such that a row is returned for each grouping-variable:

linelist %>%
  group_by(hospital) %>%
  rstatix::get_summary_stats(age, temp, type = "common")
## # A tibble: 12 x 11
##    hospital                             variable     n   min   max median   iqr  mean     sd    se    ci
##    <chr>                                <chr>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
##  1 Central Hospital                     age        445   0    58     12    15    15.7 12.5   0.591 1.16 
##  2 Central Hospital                     temp       450  35.2  40.4   38.8   1    38.5  0.964 0.045 0.089
##  3 Military Hospital                    age        884   0    72     14    18    16.1 12.4   0.417 0.818
##  4 Military Hospital                    temp       873  35.3  40.5   38.8   1    38.6  0.952 0.032 0.063
##  5 Missing                              age       1441   0    76     13    17    16.0 12.9   0.339 0.665
##  6 Missing                              temp      1431  35.8  40.6   38.9   1    38.6  0.97  0.026 0.05 
##  7 Other                                age        873   0    69     13    17    16.0 12.5   0.422 0.828
##  8 Other                                temp       862  35.7  40.8   38.8   1.1  38.5  1.01  0.034 0.067
##  9 Port Hospital                        age       1739   0    68     14    18    16.3 12.7   0.305 0.598
## 10 Port Hospital                        temp      1713  35.5  40.6   38.8   1.1  38.6  0.981 0.024 0.046
## 11 St. Mark's Maternity Hospital (SMMH) age        420   0    84     12    15    15.7 12.4   0.606 1.19 
## 12 St. Mark's Maternity Hospital (SMMH) temp       410  35.9  40.6   38.8   1.1  38.5  0.983 0.049 0.095

You can also use rstatix to conduct statistical tests:

T-test

Use a formula syntax to specify the numeric and categorical columns:

linelist %>% 
  t_test(age_years ~ gender)
## # A tibble: 1 x 10
##   .y.       group1 group2    n1    n2 statistic    df        p    p.adj p.adj.signif
## * <chr>     <chr>  <chr>  <int> <int>     <dbl> <dbl>    <dbl>    <dbl> <chr>       
## 1 age_years f      m       2807  2803     -21.3 4902. 9.89e-97 9.89e-97 ****

Or use ~ 1 and specify mu = for a one-sample T-test. This can also be done by group.

linelist %>% 
  t_test(age_years ~ 1, mu = 30)
## # A tibble: 1 x 7
##   .y.       group1 group2         n statistic    df     p
## * <chr>     <chr>  <chr>      <int>     <dbl> <dbl> <dbl>
## 1 age_years 1      null model  5888     -84.2  5801     0

If applicable, the statistical tests can be done by group, as shown below:

linelist %>% 
  group_by(gender) %>% 
  t_test(age_years ~ 1, mu = 18)
## # A tibble: 3 x 8
##   gender .y.       group1 group2         n statistic    df         p
## * <chr>  <chr>     <chr>  <chr>      <int>     <dbl> <dbl>     <dbl>
## 1 f      age_years 1      null model  2807    -29.8   2806 7.52e-170
## 2 m      age_years 1      null model  2803      5.70  2802 1.34e-  8
## 3 <NA>   age_years 1      null model   278     -3.80   191 1.96e-  4

Shapiro-Wilk test

As stated above, sample size must be between 3 and 5000.

linelist %>% 
  head(500) %>%            # first 500 rows of case linelist, for example only
  shapiro_test(age_years)
## # A tibble: 1 x 3
##   variable  statistic        p
##   <chr>         <dbl>    <dbl>
## 1 age_years     0.917 6.67e-16

Wilcoxon rank sum test

linelist %>% 
  wilcox_test(age_years ~ gender)
## # A tibble: 1 x 9
##   .y.       group1 group2    n1    n2 statistic        p    p.adj p.adj.signif
## * <chr>     <chr>  <chr>  <int> <int>     <dbl>    <dbl>    <dbl> <chr>       
## 1 age_years f      m       2807  2803   2829274 3.47e-74 3.47e-74 ****

Kruskal-Wallis test

Also known as the Mann-Whitney U test.

linelist %>% 
  kruskal_test(age_years ~ outcome)
## # A tibble: 1 x 6
##   .y.           n statistic    df     p method        
## * <chr>     <int>     <dbl> <int> <dbl> <chr>         
## 1 age_years  5888    0.0457     1 0.831 Kruskal-Wallis

Chi-squared test

The chi-square test function accepts a table, so first we create a cross-tabulation. There are many ways to create a cross-tabulation (see Descriptive tables) but here we use tabyl() from janitor and remove the left-most column of value labels before passing to chisq_test().

linelist %>% 
  tabyl(gender, outcome) %>% 
  select(-1) %>% 
  chisq_test()
## # A tibble: 1 x 6
##       n statistic     p    df method          p.signif
## * <dbl>     <dbl> <dbl> <int> <chr>           <chr>   
## 1  5888      3.53 0.473     4 Chi-square test ns

Many many more functions and statistical tests can be run with rstatix functions. See the documentation for rstatix online here or by entering ?rstatix.

18.4 gtsummary package

Use gtsummary if you are looking to add the results of a statistical test to a pretty table that was created with this package (as described in the gtsummary section of the Descriptive tables page).

Performing statistical tests of comparison with tbl_summary is done by adding the add_p function to a table and specifying which test to use. It is possible to get p-values corrected for multiple testing by using the add_q function. Run ?tbl_summary for details.

Chi-squared test

Compare the proportions of a categorical variable in two groups. The default statistical test for add_p() when applied to a categorical variable is to perform a chi-squared test of independence with continuity correction, but if any expected call count is below 5 then a Fisher’s exact test is used.

linelist %>% 
  select(gender, outcome) %>%    # keep variables of interest
  tbl_summary(by = outcome) %>%  # produce summary table and specify grouping variable
  add_p()                        # specify what test to perform
## 1323 observations missing `outcome` have been removed. To include these observations, use `forcats::fct_explicit_na()` on `outcome` column before passing to `tbl_summary()`.
Characteristic Death, N = 2,5821 Recover, N = 1,9831 p-value2
gender >0.9
f 1,227 (50%) 953 (50%)
m 1,228 (50%) 950 (50%)
Unknown 127 80

1 n (%)

2 Pearson's Chi-squared test

T-tests

Compare the difference in means for a continuous variable in two groups. For example, compare the mean age by patient outcome.

linelist %>% 
  select(age_years, outcome) %>%             # keep variables of interest
  tbl_summary(                               # produce summary table
    statistic = age_years ~ "{mean} ({sd})", # specify what statistics to show
    by = outcome) %>%                        # specify the grouping variable
  add_p(age_years ~ "t.test")                # specify what tests to perform
## 1323 observations missing `outcome` have been removed. To include these observations, use `forcats::fct_explicit_na()` on `outcome` column before passing to `tbl_summary()`.
Characteristic Death, N = 2,5821 Recover, N = 1,9831 p-value2
age_years 16 (12) 16 (13) 0.6
Unknown 32 28

1 Mean (SD)

2 Welch Two Sample t-test

Wilcoxon rank sum test

Compare the distribution of a continuous variable in two groups. The default is to use the Wilcoxon rank sum test and the median (IQR) when comparing two groups. However for non-normally distributed data or comparing multiple groups, the Kruskal-wallis test is more appropriate.

linelist %>% 
  select(age_years, outcome) %>%                       # keep variables of interest
  tbl_summary(                                         # produce summary table
    statistic = age_years ~ "{median} ({p25}, {p75})", # specify what statistic to show (this is default so could remove)
    by = outcome) %>%                                  # specify the grouping variable
  add_p(age_years ~ "wilcox.test")                     # specify what test to perform (default so could leave brackets empty)
## 1323 observations missing `outcome` have been removed. To include these observations, use `forcats::fct_explicit_na()` on `outcome` column before passing to `tbl_summary()`.
Characteristic Death, N = 2,5821 Recover, N = 1,9831 p-value2
age_years 13 (6, 23) 13 (6, 23) 0.8
Unknown 32 28

1 Median (IQR)

2 Wilcoxon rank sum test

Kruskal-wallis test

Compare the distribution of a continuous variable in two or more groups, regardless of whether the data is normally distributed.

linelist %>% 
  select(age_years, outcome) %>%                       # keep variables of interest
  tbl_summary(                                         # produce summary table
    statistic = age_years ~ "{median} ({p25}, {p75})", # specify what statistic to show (default, so could remove)
    by = outcome) %>%                                  # specify the grouping variable
  add_p(age_years ~ "kruskal.test")                    # specify what test to perform
## 1323 observations missing `outcome` have been removed. To include these observations, use `forcats::fct_explicit_na()` on `outcome` column before passing to `tbl_summary()`.
Characteristic Death, N = 2,5821 Recover, N = 1,9831 p-value2
age_years 13 (6, 23) 13 (6, 23) 0.8
Unknown 32 28

1 Median (IQR)

2 Kruskal-Wallis rank sum test

18.5 Correlations

Correlation between numeric variables can be investigated using the tidyverse
corrr package. It allows you to compute correlations using Pearson, Kendall tau or Spearman rho. The package creates a table and also has a function to automatically plot the values.

correlation_tab <- linelist %>% 
  select(generation, age, ct_blood, days_onset_hosp, wt_kg, ht_cm) %>%   # keep numeric variables of interest
  correlate()      # create correlation table (using default pearson)

correlation_tab    # print
## # A tibble: 6 x 7
##   term            generation       age ct_blood days_onset_hosp    wt_kg    ht_cm
##   <chr>                <dbl>     <dbl>    <dbl>           <dbl>    <dbl>    <dbl>
## 1 generation        NA       -0.0222    0.179         -0.288    -0.0302  -0.00942
## 2 age               -0.0222  NA         0.00849       -0.000635  0.833    0.877  
## 3 ct_blood           0.179    0.00849  NA             -0.600    -0.00636  0.0181 
## 4 days_onset_hosp   -0.288   -0.000635 -0.600         NA         0.0153  -0.00953
## 5 wt_kg             -0.0302   0.833    -0.00636        0.0153   NA        0.884  
## 6 ht_cm             -0.00942  0.877     0.0181        -0.00953   0.884   NA
## remove duplicate entries (the table above is mirrored) 
correlation_tab <- correlation_tab %>% 
  shave()

## view correlation table 
correlation_tab
## # A tibble: 6 x 7
##   term            generation       age ct_blood days_onset_hosp  wt_kg ht_cm
##   <chr>                <dbl>     <dbl>    <dbl>           <dbl>  <dbl> <dbl>
## 1 generation        NA       NA        NA              NA       NA        NA
## 2 age               -0.0222  NA        NA              NA       NA        NA
## 3 ct_blood           0.179    0.00849  NA              NA       NA        NA
## 4 days_onset_hosp   -0.288   -0.000635 -0.600          NA       NA        NA
## 5 wt_kg             -0.0302   0.833    -0.00636         0.0153  NA        NA
## 6 ht_cm             -0.00942  0.877     0.0181         -0.00953  0.884    NA
## plot correlations 
rplot(correlation_tab)

18.6 Resources

Much of the information in this page is adapted from these resources and vignettes online:

gtsummary dplyr corrr sthda correlation

19 Univariate and multivariable regression

This page demonstrates the use of base R regression functions such as glm() and the gtsummary package to look at associations between variables (e.g. odds ratios, risk ratios and hazard ratios). It also uses functions like tidy() from the broom package to clean-up regression outputs.

  1. Univariate: two-by-two tables
  2. Stratified: mantel-haenszel estimates
  3. Multivariable: variable selection, model selection, final table
  4. Forest plots

For Cox proportional hazard regression, see the Survival analysis page.

NOTE: We use the term multivariable to refer to a regression with multiple explanatory variables. In this sense a multivariate model would be a regression with several outcomes - see this editorial for detail

19.1 Preparation

Load packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
  rio,          # File import
  here,         # File locator
  tidyverse,    # data management + ggplot2 graphics, 
  stringr,      # manipulate text strings 
  purrr,        # loop over objects in a tidy way
  gtsummary,    # summary statistics and tests 
  broom,        # tidy up results from regressions
  lmtest,       # likelihood-ratio tests
  parameters,   # alternative to tidy up results from regressions
  see          # alternative to visualise forest plots
  )

Import data

We import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import your data with the import() function from the rio package (it accepts many file types like .xlsx, .rds, .csv - see the Import and export page for details).

# import the linelist
linelist <- import("linelist_cleaned.rds")

The first 50 rows of the linelist are displayed below.

Clean data

Store explanatory variables

We store the names of the explanatory columns as a character vector. This will be referenced later.

## define variables of interest 
explanatory_vars <- c("gender", "fever", "chills", "cough", "aches", "vomit")

Convert to 1’s and 0’s

Below we convert the explanatory columns from “yes”/“no”, “m”/“f”, and “dead”/“alive” to 1 / 0, to cooperate with the expectations of logistic regression models. To do this efficiently, used across() from dplyr to transform multiple columns at one time. The function we apply to each column is case_when() (also dplyr) which applies logic to convert specified values to 1’s and 0’s. See sections on across() and case_when() in the Cleaning data and core functions page).

Note: the “.” below represents the column that is being processed by across() at that moment.

## convert dichotomous variables to 0/1 
linelist <- linelist %>%  
  mutate(across(                                      
    .cols = all_of(c(explanatory_vars, "outcome")),  ## for each column listed and "outcome"
    .fns = ~case_when(                              
      . %in% c("m", "yes", "Death")   ~ 1,           ## recode male, yes and death to 1
      . %in% c("f", "no",  "Recover") ~ 0,           ## female, no and recover to 0
      TRUE                            ~ NA_real_)    ## otherwise set to missing
    )
  )

Drop rows with missing values

To drop rows with missing values, can use the tidyr function drop_na(). However, we only want to do this for rows that are missing values in the columns of interest.

The first thing we must to is make sure our explanatory_vars vector includes the column age (age would have produced an error in the previous case_when() operation, which was only for dichotomous variables). Then we pipe the linelist to drop_na() to remove any rows with missing values in the outcome column or any of the explanatory_vars columns.

Before running the code, the number of rows in the linelist is nrow(linelist).

## add in age_category to the explanatory vars 
explanatory_vars <- c(explanatory_vars, "age_cat")

## drop rows with missing information for variables of interest 
linelist <- linelist %>% 
  drop_na(any_of(c("outcome", explanatory_vars)))

The number of rows remaining in linelist is nrow(linelist).

19.2 Univariate

Just like in the page on Descriptive tables, your use case will determine which R package you use. We present two options for doing univariate analysis:

  • Use functions available in base R to quickly print results to the console. Use the broom package to tidy up the outputs.
  • Use the gtsummary package to model and get publication-ready outputs

base R

Linear regression

The base R function lm() perform linear regression, assessing the relationship between numeric response and explanatory variables that are assumed to have a linear relationship.

Provide the equation as a formula, with the response and explanatory column names separated by a tilde ~. Also, specify the dataset to data =. Define the model results as an R object, to use later.

lm_results <- lm(ht_cm ~ age, data = linelist)

You can then run summary() on the model results to see the coefficients (Estimates), P-value, residuals, and other measures.

summary(lm_results)
## 
## Call:
## lm(formula = ht_cm ~ age, data = linelist)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -128.579  -15.854    1.177   15.887  175.483 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.9051     0.5979   116.9   <2e-16 ***
## age           3.4354     0.0293   117.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.75 on 4165 degrees of freedom
## Multiple R-squared:  0.7675, Adjusted R-squared:  0.7674 
## F-statistic: 1.375e+04 on 1 and 4165 DF,  p-value: < 2.2e-16

Alternatively you can use the tidy() function from the broom package to pull the results in to a table. What the results tell us is that for each year increase in age the height increases by 3.5 cm and this is statistically significant.

tidy(lm_results)
## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)    69.9     0.598       117.       0
## 2 age             3.44    0.0293      117.       0

You can then also use this regression to add it to a ggplot, to do this we first pull the points for the observed data and the fitted line in to one data frame using the augment() function from broom.

## pull the regression points and observed data in to one dataset
points <- augment(lm_results)

## plot the data using age as the x-axis 
ggplot(points, aes(x = age)) + 
  ## add points for height 
  geom_point(aes(y = ht_cm)) + 
  ## add your regression line 
  geom_line(aes(y = .fitted), colour = "red")

It is also possible to add a simple linear regression straight straight in ggplot using the geom_smooth() function.

## add your data to a plot 
 ggplot(linelist, aes(x = age, y = ht_cm)) + 
  ## show points
  geom_point() + 
  ## add a linear regression 
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

See the Resource section at the end of this chapter for more detailed tutorials.

Logistic regression

The function glm() from the stats package (part of base R) is used to fit Generalized Linear Models (GLM).

glm() can be used for univariate and multivariable logistic regression (e.g. to get Odds Ratios). Here are the core parts:

# arguments for glm()
glm(formula, family, data, weights, subset, ...)
  • formula = The model is provided to glm() as an equation, with the outcome on the left and explanatory variables on the right of a tilde ~.
  • family = This determines the type of model to run. For logistic regression, use family = "binomial", for poisson use family = "poisson". Other examples are in the table below.
  • data = Specify your data frame

If necessary, you can also specify the link function via the syntax family = familytype(link = "linkfunction")). You can read more in the documentation about other families and optional arguments such as weights = and subset = (?glm).

Family Default link function
"binomial" (link = "logit")
"gaussian" (link = "identity")
"Gamma" (link = "inverse")
"inverse.gaussian" (link = "1/mu^2")
"poisson" (link = "log")
"quasi" (link = "identity", variance = "constant")
"quasibinomial" (link = "logit")
"quasipoisson" (link = "log")

When running glm() it is most common to save the results as a named R object. Then you can print the results to your console using summary() as shown below, or perform other operations on the results (e.g. exponentiate).

If you need to run a negative binomial regression you can use the MASS package; the glm.nb() uses the same syntax as glm(). For a walk-through of different regressions, see the UCLA stats page.

Univariate glm()

In this example we are assessing the association between different age categories and the outcome of death (coded as 1 in the Preparation section). Below is a univariate model of outcome by age_cat. We save the model output as model and then print it with summary() to the console. Note the estimates provided are the log odds and that the baseline level is the first factor level of age_cat (“0-4”).

model <- glm(outcome ~ age_cat, family = "binomial", data = linelist)
summary(model)
## 
## Call:
## glm(formula = outcome ~ age_cat, family = "binomial", data = linelist)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.339  -1.278   1.024   1.080   1.354  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   0.233738   0.072805   3.210  0.00133 **
## age_cat5-9   -0.062898   0.101733  -0.618  0.53640   
## age_cat10-14  0.138204   0.107186   1.289  0.19726   
## age_cat15-19 -0.005565   0.113343  -0.049  0.96084   
## age_cat20-29  0.027511   0.102133   0.269  0.78765   
## age_cat30-49  0.063764   0.113771   0.560  0.57517   
## age_cat50-69 -0.387889   0.259240  -1.496  0.13459   
## age_cat70+   -0.639203   0.915770  -0.698  0.48518   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5712.4  on 4166  degrees of freedom
## Residual deviance: 5705.1  on 4159  degrees of freedom
## AIC: 5721.1
## 
## Number of Fisher Scoring iterations: 4

To alter the baseline level of a given variable, ensure the column is class Factor and move the desired level to the first position with fct_relevel() (see page on Factors). For example, below we take column age_cat and set “20-29” as the baseline before piping the modified data frame into glm().

linelist %>% 
  mutate(age_cat = fct_relevel(age_cat, "20-29", after = 0)) %>% 
  glm(formula = outcome ~ age_cat, family = "binomial") %>% 
  summary()
## 
## Call:
## glm(formula = outcome ~ age_cat, family = "binomial", data = .)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.339  -1.278   1.024   1.080   1.354  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.26125    0.07163   3.647 0.000265 ***
## age_cat0-4   -0.02751    0.10213  -0.269 0.787652    
## age_cat5-9   -0.09041    0.10090  -0.896 0.370220    
## age_cat10-14  0.11069    0.10639   1.040 0.298133    
## age_cat15-19 -0.03308    0.11259  -0.294 0.768934    
## age_cat30-49  0.03625    0.11302   0.321 0.748390    
## age_cat50-69 -0.41540    0.25891  -1.604 0.108625    
## age_cat70+   -0.66671    0.91568  -0.728 0.466546    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5712.4  on 4166  degrees of freedom
## Residual deviance: 5705.1  on 4159  degrees of freedom
## AIC: 5721.1
## 
## Number of Fisher Scoring iterations: 4

Printing results

For most uses, several modifications must be made to the above outputs. The function tidy() from the package broom is convenient for making the model results presentable.

Here we demonstrate how to combine model outputs with a table of counts.

  1. Get the exponentiated log odds ratio estimates and confidence intervals by passing the model to tidy() and setting exponentiate = TRUE and conf.int = TRUE.
model <- glm(outcome ~ age_cat, family = "binomial", data = linelist) %>% 
  tidy(exponentiate = TRUE, conf.int = TRUE) %>%        # exponentiate and produce CIs
  mutate(across(where(is.numeric), round, digits = 2))  # round all numeric columns

Below is the outputted tibble model:

  1. Combine these model results with a table of counts. Below, we create the a counts cross-table with the tabyl() function from janitor, as covered in the Descriptive tables page.
counts_table <- linelist %>% 
  janitor::tabyl(age_cat, outcome)

Here is what this counts_table data frame looks like:

Now we can bind the counts_table and the model results together horizontally with bind_cols() (dplyr). Remember that with bind_cols() the rows in the two data frames must be aligned perfectly. In this code, because we are binding within a pipe chain, we use . to represent the piped object counts_table as we bind it to model. To finish the process, we use select() to pick the desired columns and their order, and finally apply the base R round() function across all numeric columns to specify 2 decimal places.

combined <- counts_table %>%           # begin with table of counts
  bind_cols(., model) %>%              # combine with the outputs of the regression 
  select(term, 2:3, estimate,          # select and re-order cols
         conf.low, conf.high, p.value) %>% 
  mutate(across(where(is.numeric), round, digits = 2)) ## round to 2 decimal places

Here is what the combined data frame looks like, printed nicely as an image with a function from flextable. The Tables for presentation explains how to customize such tables with flextable, or or you can use numerous other packages such as knitr or GT.

combined <- combined %>% 
  flextable::qflextable()

Looping multiple univariate models

Below we present a method using glm() and tidy() for a more simple approach, see the section on gtsummary.

To run the models on several exposure variables to produce univariate odds ratios (i.e. not controlling for each other), you can use the approach below. It uses str_c() from stringr to create univariate formulas (see Characters and strings), runs the glm() regression on each formula, passes each glm() output to tidy() and finally collapses all the model outputs together with bind_rows() from tidyr. This approach uses map() from the package purrr to iterate - see the page on Iteration, loops, and lists for more information on this tool.

  1. Create a vector of column names of the explanatory variables. We already have this as explanatory_vars from the Preparation section of this page.

  2. Use str_c() to create multiple string formulas, with outcome on the left, and a column name from explanatory_vars on the right. The period . substitutes for the column name in explanatory_vars.

explanatory_vars %>% str_c("outcome ~ ", .)
## [1] "outcome ~ gender"  "outcome ~ fever"   "outcome ~ chills"  "outcome ~ cough"   "outcome ~ aches"   "outcome ~ vomit"   "outcome ~ age_cat"
  1. Pass these string formulas to map() and set ~glm() as the function to apply to each input. Within glm(), set the regression formula as as.formula(.x) where .x will be replaced by the string formula defined in the step above. map() will loop over each of the string formulas, running regressions for each one.

  2. The outputs of this first map() are passed to a second map() command, which applies tidy() to the regression outputs.

  3. Finally the output of the second map() (a list of tidied data frames) is condensed with bind_rows(), resulting in one data frame with all the univariate results.

models <- explanatory_vars %>%       # begin with variables of interest
  str_c("outcome ~ ", .) %>%         # combine each variable into formula ("outcome ~ variable of interest")
  
  # iterate through each univariate formula
  map(                               
    .f = ~glm(                       # pass the formulas one-by-one to glm()
      formula = as.formula(.x),      # within glm(), the string formula is .x
      family = "binomial",           # specify type of glm (logistic)
      data = linelist)) %>%          # dataset
  
  # tidy up each of the glm regression outputs from above
  map(
    .f = ~tidy(
      .x, 
      exponentiate = TRUE,           # exponentiate 
      conf.int = TRUE)) %>%          # return confidence intervals
  
  # collapse the list of regression outputs in to one data frame
  bind_rows() %>% 
  
  # round all numeric columns
  mutate(across(where(is.numeric), round, digits = 2))

This time, the end object models is longer because it now represents the combined results of several univariate regressions. Click through to see all the rows of model.

As before, we can create a counts table from the linelist for each explanatory variable, bind it to models, and make a nice table. We begin with the variables, and iterate through them with map(). We iterate through a user-defined function which involves creating a counts table with dplyr functions. Then the results are combined and bound with the models model results.

## for each explanatory variable
univ_tab_base <- explanatory_vars %>% 
  map(.f = 
    ~{linelist %>%                ## begin with linelist
        group_by(outcome) %>%     ## group data set by outcome
        count(.data[[.x]]) %>%    ## produce counts for variable of interest
        pivot_wider(              ## spread to wide format (as in cross-tabulation)
          names_from = outcome,
          values_from = n) %>% 
        drop_na(.data[[.x]]) %>%         ## drop rows with missings
        rename("variable" = .x) %>%      ## change variable of interest column to "variable"
        mutate(variable = as.character(variable))} ## convert to character, else non-dichotomous (categorical) variables come out as factor and cant be merged
      ) %>% 
  
  ## collapse the list of count outputs in to one data frame
  bind_rows() %>% 
  
  ## merge with the outputs of the regression 
  bind_cols(., models) %>% 
  
  ## only keep columns interested in 
  select(term, 2:3, estimate, conf.low, conf.high, p.value) %>% 
  
  ## round decimal places
  mutate(across(where(is.numeric), round, digits = 2))

Below is what the data frame looks like. See the page on Tables for presentation for ideas on how to further convert this table to pretty HTML output (e.g. with flextable).

gtsummary package

Below we present the use of tbl_uvregression() from the gtsummary package. Just like in the page on Descriptive tables, gtsummary functions do a good job of running statistics and producing professional-looking outputs. This function produces a table of univariate regression results.

We select only the necessary columns from the linelist (explanatory variables and the outcome variable) and pipe them into tbl_uvregression(). We are going to run univariate regression on each of the columns we defined as explanatory_vars in the data Preparation section (gender, fever, chills, cough, aches, vomit, and age_cat).

Within the function itself, we provide the method = as glm (no quotes), the y = outcome column (outcome), specify to method.args = that we want to run logistic regression via family = binomial, and we tell it to exponentiate the results.

The output is HTML and contains the counts

univ_tab <- linelist %>% 
  dplyr::select(explanatory_vars, outcome) %>% ## select variables of interest

  tbl_uvregression(                         ## produce univariate table
    method = glm,                           ## define regression want to run (generalised linear model)
    y = outcome,                            ## define outcome variable
    method.args = list(family = binomial),  ## define what type of glm want to run (logistic)
    exponentiate = TRUE                     ## exponentiate to produce odds ratios (rather than log odds)
  )

## view univariate results table 
univ_tab
Characteristic N OR1 95% CI1 p-value
gender 4167 1.00 0.88, 1.13 >0.9
fever 4167 1.00 0.85, 1.17 >0.9
chills 4167 1.03 0.89, 1.21 0.7
cough 4167 1.15 0.97, 1.37 0.11
aches 4167 0.93 0.76, 1.14 0.5
vomit 4167 1.09 0.96, 1.23 0.2
age_cat 4167
0-4
5-9 0.94 0.77, 1.15 0.5
10-14 1.15 0.93, 1.42 0.2
15-19 0.99 0.80, 1.24 >0.9
20-29 1.03 0.84, 1.26 0.8
30-49 1.07 0.85, 1.33 0.6
50-69 0.68 0.41, 1.13 0.13
70+ 0.53 0.07, 3.20 0.5

1 OR = Odds Ratio, CI = Confidence Interval

There are many modifications you can make to this table output, such as adjusting the text labels, bolding rows by their p-value, etc. See tutorials here and elsewhere online.

19.3 Stratified

Stratified analysis is currently still being worked on for gtsummary, this page will be updated in due course.

19.4 Multivariable

For multivariable analysis, we again present two approaches:

  • glm() and tidy()
  • gtsummary package

The workflow is similar for each and only the last step of pulling together a final table is different.

Conduct multivariable

Here we use glm() but add more variables to the right side of the equation, separated by plus symbols (+).

To run the model with all of our explanatory variables we would run:

mv_reg <- glm(outcome ~ gender + fever + chills + cough + aches + vomit + age_cat, family = "binomial", data = linelist)

summary(mv_reg)
## 
## Call:
## glm(formula = outcome ~ gender + fever + chills + cough + aches + 
##     vomit + age_cat, family = "binomial", data = linelist)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.383  -1.279   1.029   1.078   1.346  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)   0.069054   0.131726   0.524    0.600
## gender        0.002448   0.065133   0.038    0.970
## fever         0.004309   0.080522   0.054    0.957
## chills        0.034112   0.078924   0.432    0.666
## cough         0.138584   0.089909   1.541    0.123
## aches        -0.070705   0.104078  -0.679    0.497
## vomit         0.086098   0.062618   1.375    0.169
## age_cat5-9   -0.063562   0.101851  -0.624    0.533
## age_cat10-14  0.136372   0.107275   1.271    0.204
## age_cat15-19 -0.011074   0.113640  -0.097    0.922
## age_cat20-29  0.026552   0.102780   0.258    0.796
## age_cat30-49  0.059569   0.116402   0.512    0.609
## age_cat50-69 -0.388964   0.262384  -1.482    0.138
## age_cat70+   -0.647443   0.917375  -0.706    0.480
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5712.4  on 4166  degrees of freedom
## Residual deviance: 5700.2  on 4153  degrees of freedom
## AIC: 5728.2
## 
## Number of Fisher Scoring iterations: 4

If you want to include two variables and an interaction between them you can separate them with an asterisk * instead of a +. Separate them with a colon : if you are only specifying the interaction. For example:

glm(outcome ~ gender + age_cat * fever, family = "binomial", data = linelist)

Optionally, you can use this code to leverage the pre-defined vector of column names and re-create the above command using str_c(). This might be useful if your explanatory variable names are changing, or you don’t want to type them all out again.

## run a regression with all variables of interest 
mv_reg <- explanatory_vars %>%  ## begin with vector of explanatory column names
  str_c(collapse = "+") %>%     ## combine all names of the variables of interest separated by a plus
  str_c("outcome ~ ", .) %>%    ## combine the names of variables of interest with outcome in formula style
  glm(family = "binomial",      ## define type of glm as logistic,
      data = linelist)          ## define your dataset

Building the model

You can build your model step-by-step, saving various models that include certain explanatory variables. You can compare these models with likelihood-ratio tests using lrtest() from the package lmtest, as below:

NOTE: Using base anova(model1, model2, test = "Chisq) produces the same results

model1 <- glm(outcome ~ age_cat, family = "binomial", data = linelist)
model2 <- glm(outcome ~ age_cat + gender, family = "binomial", data = linelist)

lmtest::lrtest(model1, model2)
## Likelihood ratio test
## 
## Model 1: outcome ~ age_cat
## Model 2: outcome ~ age_cat + gender
##   #Df  LogLik Df  Chisq Pr(>Chisq)
## 1   8 -2852.6                     
## 2   9 -2852.6  1 0.0002     0.9883

Another option is to take the model object and apply the step() function from the stats package. Specify which variable selection direction you want use when building the model.

## choose a model using forward selection based on AIC
## you can also do "backward" or "both" by adjusting the direction
final_mv_reg <- mv_reg %>%
  step(direction = "forward", trace = FALSE)

You can also turn off scientific notation in your R session, for clarity:

options(scipen=999)

As described in the section on univariate analysis, pass the model output to tidy() to exponentiate the log odds and CIs. Finally we round all numeric columns to two decimal places. Scroll through to see all the rows.

mv_tab_base <- final_mv_reg %>% 
  broom::tidy(exponentiate = TRUE, conf.int = TRUE) %>%  ## get a tidy dataframe of estimates 
  mutate(across(where(is.numeric), round, digits = 2))          ## round 

Here is what the resulting data frame looks like:

Combine univariate and multivariable

Combine with gtsummary

The gtsummary package provides the tbl_regression() function, which will take the outputs from a regression (glm() in this case) and produce an nice summary table.

## show results table of final regression 
mv_tab <- tbl_regression(final_mv_reg, exponentiate = TRUE)

Let’s see the table:

mv_tab
Characteristic OR1 95% CI1 p-value
gender 1.00 0.88, 1.14 >0.9
fever 1.00 0.86, 1.18 >0.9
chills 1.03 0.89, 1.21 0.7
cough 1.15 0.96, 1.37 0.12
aches 0.93 0.76, 1.14 0.5
vomit 1.09 0.96, 1.23 0.2
age_cat
0-4
5-9 0.94 0.77, 1.15 0.5
10-14 1.15 0.93, 1.41 0.2
15-19 0.99 0.79, 1.24 >0.9
20-29 1.03 0.84, 1.26 0.8
30-49 1.06 0.85, 1.33 0.6
50-69 0.68 0.40, 1.13 0.14
70+ 0.52 0.07, 3.19 0.5

1 OR = Odds Ratio, CI = Confidence Interval

You can also combine several different output tables produced by gtsummary with the tbl_merge() function. We now combine the multivariable results with the gtsummary univariate results that we created above:

## combine with univariate results 
tbl_merge(
  tbls = list(univ_tab, mv_tab),                          # combine
  tab_spanner = c("**Univariate**", "**Multivariable**")) # set header names
Characteristic Univariate Multivariable
N OR1 95% CI1 p-value OR1 95% CI1 p-value
gender 4167 1.00 0.88, 1.13 >0.9 1.00 0.88, 1.14 >0.9
fever 4167 1.00 0.85, 1.17 >0.9 1.00 0.86, 1.18 >0.9
chills 4167 1.03 0.89, 1.21 0.7 1.03 0.89, 1.21 0.7
cough 4167 1.15 0.97, 1.37 0.11 1.15 0.96, 1.37 0.12
aches 4167 0.93 0.76, 1.14 0.5 0.93 0.76, 1.14 0.5
vomit 4167 1.09 0.96, 1.23 0.2 1.09 0.96, 1.23 0.2
age_cat 4167
0-4
5-9 0.94 0.77, 1.15 0.5 0.94 0.77, 1.15 0.5
10-14 1.15 0.93, 1.42 0.2 1.15 0.93, 1.41 0.2
15-19 0.99 0.80, 1.24 >0.9 0.99 0.79, 1.24 >0.9
20-29 1.03 0.84, 1.26 0.8 1.03 0.84, 1.26 0.8
30-49 1.07 0.85, 1.33 0.6 1.06 0.85, 1.33 0.6
50-69 0.68 0.41, 1.13 0.13 0.68 0.40, 1.13 0.14
70+ 0.53 0.07, 3.20 0.5 0.52 0.07, 3.19 0.5

1 OR = Odds Ratio, CI = Confidence Interval

Combine with dplyr

An alternative way of combining the glm()/tidy() univariate and multivariable outputs is with the dplyr join functions.

  • Join the univariate results from earlier (univ_tab_base, which contains counts) with the tidied multivariable results mv_tab_base
  • Use select() to keep only the columns we want, specify their order, and re-name them
  • Use round() with two decimal places on all the column that are class Double
## combine univariate and multivariable tables 
left_join(univ_tab_base, mv_tab_base, by = "term") %>% 
  ## choose columns and rename them
  select( # new name =  old name
    "characteristic" = term, 
    "recovered"      = "0", 
    "dead"           = "1", 
    "univ_or"        = estimate.x, 
    "univ_ci_low"    = conf.low.x, 
    "univ_ci_high"   = conf.high.x,
    "univ_pval"      = p.value.x, 
    "mv_or"          = estimate.y, 
    "mvv_ci_low"     = conf.low.y, 
    "mv_ci_high"     = conf.high.y,
    "mv_pval"        = p.value.y 
  ) %>% 
  mutate(across(where(is.double), round, 2))   
## # A tibble: 20 x 11
##    characteristic recovered  dead univ_or univ_ci_low univ_ci_high univ_pval mv_or mvv_ci_low mv_ci_high mv_pval
##    <chr>              <dbl> <dbl>   <dbl>       <dbl>        <dbl>     <dbl> <dbl>      <dbl>      <dbl>   <dbl>
##  1 (Intercept)          909  1168    1.28        1.18         1.4       0     1.07       0.83       1.39    0.6 
##  2 gender               916  1174    1           0.88         1.13      0.97  1          0.88       1.14    0.97
##  3 (Intercept)          340   436    1.28        1.11         1.48      0     1.07       0.83       1.39    0.6 
##  4 fever               1485  1906    1           0.85         1.17      0.99  1          0.86       1.18    0.96
##  5 (Intercept)         1472  1877    1.28        1.19         1.37      0     1.07       0.83       1.39    0.6 
##  6 chills               353   465    1.03        0.89         1.21      0.68  1.03       0.89       1.21    0.67
##  7 (Intercept)          272   309    1.14        0.97         1.34      0.13  1.07       0.83       1.39    0.6 
##  8 cough               1553  2033    1.15        0.97         1.37      0.11  1.15       0.96       1.37    0.12
##  9 (Intercept)         1636  2114    1.29        1.21         1.38      0     1.07       0.83       1.39    0.6 
## 10 aches                189   228    0.93        0.76         1.14      0.51  0.93       0.76       1.14    0.5 
## 11 (Intercept)          931  1144    1.23        1.13         1.34      0     1.07       0.83       1.39    0.6 
## 12 vomit                894  1198    1.09        0.96         1.23      0.17  1.09       0.96       1.23    0.17
## 13 (Intercept)          338   427    1.26        1.1          1.46      0     1.07       0.83       1.39    0.6 
## 14 age_cat5-9           365   433    0.94        0.77         1.15      0.54  0.94       0.77       1.15    0.53
## 15 age_cat10-14         273   396    1.15        0.93         1.42      0.2   1.15       0.93       1.41    0.2 
## 16 age_cat15-19         238   299    0.99        0.8          1.24      0.96  0.99       0.79       1.24    0.92
## 17 age_cat20-29         345   448    1.03        0.84         1.26      0.79  1.03       0.84       1.26    0.8 
## 18 age_cat30-49         228   307    1.07        0.85         1.33      0.58  1.06       0.85       1.33    0.61
## 19 age_cat50-69          35    30    0.68        0.41         1.13      0.13  0.68       0.4        1.13    0.14
## 20 age_cat70+             3     2    0.53        0.07         3.2       0.49  0.52       0.07       3.19    0.48

19.5 Forest plot

This section shows how to produce a plot with the outputs of your regression. There are two options, you can build a plot yourself using ggplot2 or use a meta-package called easystats (a package that includes many packages).

See the page on ggplot basics if you are unfamiliar with the ggplot2 plotting package.

ggplot2 package

You can build a forest plot with ggplot() by plotting elements of the multivariable regression results. Add the layers of the plots using these “geoms”:

  • estimates with geom_point()
  • confidence intervals with geom_errorbar()
  • a vertical line at OR = 1 with geom_vline()

Before plotting, you may want to use fct_relevel() from the forcats package to set the order of the variables/levels on the y-axis. ggplot() may display them in alpha-numeric order which would not work well for these age category values (“30” would appear before “5”). See the page on Factors for more details.

## remove the intercept term from your multivariable results
mv_tab_base %>% 
  
  #set order of levels to appear along y-axis
  mutate(term = fct_relevel(
    term,
    "vomit", "gender", "fever", "cough", "chills", "aches",
    "age_cat5-9", "age_cat10-14", "age_cat15-19", "age_cat20-29",
    "age_cat30-49", "age_cat50-69", "age_cat70+")) %>%
  
  # remove "intercept" row from plot
  filter(term != "(Intercept)") %>% 
  
  ## plot with variable on the y axis and estimate (OR) on the x axis
  ggplot(aes(x = estimate, y = term)) +
  
  ## show the estimate as a point
  geom_point() + 
  
  ## add in an error bar for the confidence intervals
  geom_errorbar(aes(xmin = conf.low, xmax = conf.high)) + 
  
  ## show where OR = 1 is for reference as a dashed line
  geom_vline(xintercept = 1, linetype = "dashed")

easystats packages

An alternative, if you do not want to the fine level of control that ggplot2 provides, is to use a combination of easystats packages.

The function model_parameters() from the parameters package does the equivalent of the broom package function tidy(). The see package then accepts those outputs and creates a default forest plot as a ggplot() object.

pacman::p_load(easystats)
## Installing package into 'C:/Users/neale/OneDrive - Neale Batra/Documents/Analytics-LAPTOP-RS5P2IBO/R/Projects/R handbook/epiRhandbook_eng/renv/library/R-4.1/x86_64-w64-mingw32'
## (as 'lib' is unspecified)
## Warning: package 'easystats' is not available for this version of R
## 
## A version of this package for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
## Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.1:
##   cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.1/PACKAGES'
## Warning in p_install(package, character.only = TRUE, ...):
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, : there is no package called 'easystats'
## Warning in pacman::p_load(easystats): Failed to install/load:
## easystats
## remove the intercept term from your multivariable results
final_mv_reg %>% 
  model_parameters(exponentiate = TRUE) %>% 
  plot()

19.6 Resources

The content of this page was informed by these resources and vignettes online:

Linear regression in R

gtsummary

UCLA stats page

sthda stepwise regression

20 Missing data

This page will cover how to:

  1. Assess missingness
  2. Filter out rows by missingness
  3. Plot missingness over time
  4. Handle how NA is displayed in plots
  5. Perform missing value imputation: MCAR, MAR, MNAR

20.1 Preparation

Load packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
  rio,           # import/export
  tidyverse,     # data mgmt and viz
  naniar,        # assess and visualize missingness
  mice           # missing data imputation
)

Import data

We import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import your data with the import() function from the rio package (it accepts many file types like .xlsx, .rds, .csv - see the Import and export page for details).

# import the linelist
linelist <- import("linelist_cleaned.rds")

The first 50 rows of the linelist are displayed below.

Convert missing on import

When importing your data, be aware of values that should be classified as missing. For example, 99, 999, “Missing”, blank cells ("“), or cells with an empty space (” "). You can convert these to NA (R’s version of missing data) during the data import command.
See the page on importing page section on Missing data for details, as the exact syntax varies by file type.

20.2 Missing values in R

Below we explore ways that missingness is presented and assessed in R, along with some adjacent values and functions.

NA

In R, missing values are represented by a reserved (special) value - NA. Note that this is typed without quotes. “NA” is different and is just a normal character value (also a Beatles lyric from the song Hey Jude).

Your data may have other ways of representing missingness, such as “99”, or “Missing”, or “Unknown” - you may even have empty character value "" which looks “blank”, or a single space " ". Be aware of these and consider whether to convert them to NA during import or during data cleaning with na_if().

In your data cleaning, you may also want to convert the other way - changing all NA to “Missing” or similar with replace_na() or with fct_explicit_na() for factors.

Versions of NA

Most of the time, NA represents a missing value and everything works fine. However, in some circumstances you may encounter the need for variations of NA specific to an object class (character, numeric, etc). This will be rare, but you should be aware.
The typical scenario for this is when creating a new column with the dplyr function case_when(). As described in the Cleaning data and core functions page, this function evaluates every row in the data frame, assess whether the rows meets specified logical criteria (right side of the code), and assigns the correct new value (left side of the code). Importantly: all values on the right side must be the same class.

linelist <- linelist %>% 
  
  # Create new "age_years" column from "age" column
  mutate(age_years = case_when(
    age_unit == "years"  ~ age,       # if age is given in years, assign original value
    age_unit == "months" ~ age/12,    # if age is given in months, divide by 12
    is.na(age_unit)      ~ age,       # if age UNIT is missing, assume years
    TRUE                 ~ NA_real_)) # any other circumstance, assign missing

If you want NA on the right side, you may need to specify one of the special NA options listed below. If the other right side values are character, consider using “Missing” instead or otherwise use NA_character_. If they are all numeric, use NA_real_. If they are all dates or logical, you can use NA.

  • NA - use for dates or logical TRUE/FALSE
  • NA_character_ - use for characters
  • NA_real_ - use for numeric

Again, it is not likely you will encounter these variations unless you are using case_when() to create a new column. See the R documentation on NA for more information.

NULL

NULL is another reserved value in R. It is the logical representation of a statement that is neither true nor false. It is returned by expressions or functions whose values are undefined. Generally do not assign NULL as a value, unless writing functions or perhaps writing a shiny app to return NULL in specific scenarios.

Null-ness can be assessed using is.null() and conversion can made with as.null().

See this blog post on the difference between NULL and NA.

NaN

Impossible values are represented by the special value NaN. An example of this is when you force R to divide 0 by 0. You can assess this with is.nan(). You may also encounter complementary functions including is.infinite() and is.finite().

Inf

Inf represents an infinite value, such as when you divide a number by 0.

As an example of how this might impact your work: let’s say you have a vector/column z that contains these values: z <- c(1, 22, NA, Inf, NaN, 5)

If you want to use max() on the column to find the highest value, you can use the na.rm = TRUE to remove the NA from the calculation, but the Inf and NaN remain and Inf will be returned. To resolve this, you can use brackets [ ] and is.finite() to subset such that only finite values are used for the calculation: max(z[is.finite(z)]).

z <- c(1, 22, NA, Inf, NaN, 5)
max(z)                           # returns NA
max(z, na.rm=T)                  # returns Inf
max(z[is.finite(z)])             # returns 22

Examples

R command Outcome
5 / 0 Inf
0 / 0 NaN
5 / NA NA
5 / Inf |0NA - 5|NAInf / 5|Infclass(NA)| "logical"class(NaN)| "numeric"class(Inf)| "numeric"class(NULL)` “NULL”

“NAs introduced by coercion” is a common warning message. This can happen if you attempt to make an illegal conversion like inserting a character value into a vector that is otherwise numeric.

as.numeric(c("10", "20", "thirty", "40"))
## Warning: NAs introduced by coercion
## [1] 10 20 NA 40

NULL is ignored in a vector.

my_vector <- c(25, NA, 10, NULL)  # define
my_vector                         # print
## [1] 25 NA 10

Variance of one number results in NA.

var(22)
## [1] NA

20.3 Useful functions

The following are useful base R functions when assessing or handling missing values:

is.na() and !is.na()

Use is.na()to identify missing values, or use its opposite (with ! in front) to identify non-missing values. These both return a logical value (TRUE or FALSE). Remember that you can sum() the resulting vector to count the number TRUE, e.g. sum(is.na(linelist$date_outcome)).

my_vector <- c(1, 4, 56, NA, 5, NA, 22)
is.na(my_vector)
## [1] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
!is.na(my_vector)
## [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE
sum(is.na(my_vector))
## [1] 2

na.omit()

This function, if applied to a data frame, will remove rows with any missing values. It is also from base R.
If applied to a vector, it will remove NA values from the vector it is applied to. For example:

na.omit(my_vector)
## [1]  1  4 56  5 22
## attr(,"na.action")
## [1] 4 6
## attr(,"class")
## [1] "omit"

drop_na()

This is a tidyr function that is useful in a data cleaning pipeline. If run with the parentheses empty, it removes rows with any missing values. If column names are specified in the parentheses, rows with missing values in those columns will be dropped. You can also use “tidyselect” syntax to specify the columns.

linelist %>% 
  drop_na(case_id, date_onset, age) # drops rows missing values for any of these columns

na.rm = TRUE

When you run a mathematical function such as max(), min(), sum() or mean(), if there are any NA values present the returned value will be NA. This default behavior is intentional, so that you are alerted if any of your data are missing.

You can avoid this by removing missing values from the calculation. To do this, include the argument na.rm = TRUE (“na.rm” stands for “remove NA”).

my_vector <- c(1, 4, 56, NA, 5, NA, 22)

mean(my_vector)     
## [1] NA
mean(my_vector, na.rm = TRUE)
## [1] 17.6

20.4 Assess missingness in a data frame

You can use the package naniar to assess and visualize missingness in the data frame linelist.

# install and/or load package
pacman::p_load(naniar)

Quantifying missingness

To find the percent of all values that are missing use pct_miss(). Use n_miss() to get the number of missing values.

# percent of ALL data frame values that are missing
pct_miss(linelist)
## [1] 6.688745

The two functions below return the percent of rows with any missing value, or that are entirely complete, respectively. Remember that NA means missing, and that `"" or " " will not be counted as missing.

# Percent of rows with any value missing
pct_miss_case(linelist)   # use n_complete() for counts
## [1] 69.12364
# Percent of rows that are complete (no values missing)  
pct_complete_case(linelist) # use n_complete() for counts
## [1] 30.87636

Visualizing missingness

The gg_miss_var() function will show you the number (or %) of missing values in each column. A few nuances:

  • You can add a column name (not in quote) to the argument facet = to see the plot by groups
  • By default, counts are shown instead of percents, change this with show_pct = TRUE
  • You can add axis and title labels as for a normal ggplot() with + labs(...)
gg_miss_var(linelist, show_pct = TRUE)
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.

Here the data are piped %>% into the function. The facet = argument is also used to split the data.

linelist %>% 
  gg_miss_var(show_pct = TRUE, facet = outcome)
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.

You can use vis_miss() to visualize the data frame as a heatmap, showing whether each value is missing or not. You can also select() certain columns from the data frame and provide only those columns to the function.

# Heatplot of missingness across the entire data frame  
vis_miss(linelist)

Explore and visualize missingness relationships

How do you visualize something that is not there??? By default, ggplot() removes points with missing values from plots.

naniar offers a solution via geom_miss_point(). When creating a scatterplot of two columns, records with one of the values missing and the other value present are shown by setting the missing values to 10% lower than the lowest value in the column, and coloring them distinctly.

In the scatterplot below, the red dots are records where the value for one column is present but the value for the other column is missing. This allows you to see the distribution of missing values in relation to the non-missing values.

ggplot(
  data = linelist,
  mapping = aes(x = age_years, y = temp)) +     
  geom_miss_point()

To assess missingness in the data frame stratified by another column, consider gg_miss_fct(), which returns a heatmap of percent missingness in the data frame by a factor/categorical (or date) column:

gg_miss_fct(linelist, age_cat5)

This function can also be used with a date column to see how missingness has changed over time:

gg_miss_fct(linelist, date_onset)
## Warning: Removed 29 rows containing missing values (geom_tile).

“Shadow” columns

Another way to visualize missingness in one column by values in a second column is using the “shadow” that naniar can create. bind_shadow() creates a binary NA/not NA column for every existing column, and binds all these new columns to the original dataset with the appendix "_NA". This doubles the number of columns - see below:

shadowed_linelist <- linelist %>% 
  bind_shadow()

names(shadowed_linelist)
##  [1] "case_id"                 "generation"              "date_infection"          "date_onset"              "date_hospitalisation"    "date_outcome"           
##  [7] "outcome"                 "gender"                  "age"                     "age_unit"                "age_years"               "age_cat"                
## [13] "age_cat5"                "hospital"                "lon"                     "lat"                     "infector"                "source"                 
## [19] "wt_kg"                   "ht_cm"                   "ct_blood"                "fever"                   "chills"                  "cough"                  
## [25] "aches"                   "vomit"                   "temp"                    "time_admission"          "bmi"                     "days_onset_hosp"        
## [31] "case_id_NA"              "generation_NA"           "date_infection_NA"       "date_onset_NA"           "date_hospitalisation_NA" "date_outcome_NA"        
## [37] "outcome_NA"              "gender_NA"               "age_NA"                  "age_unit_NA"             "age_years_NA"            "age_cat_NA"             
## [43] "age_cat5_NA"             "hospital_NA"             "lon_NA"                  "lat_NA"                  "infector_NA"             "source_NA"              
## [49] "wt_kg_NA"                "ht_cm_NA"                "ct_blood_NA"             "fever_NA"                "chills_NA"               "cough_NA"               
## [55] "aches_NA"                "vomit_NA"                "temp_NA"                 "time_admission_NA"       "bmi_NA"                  "days_onset_hosp_NA"

These “shadow” columns can be used to plot the proportion of values that are missing, by any another column.

For example, the plot below shows the proportion of records missing days_onset_hosp (number of days from symptom onset to hospitalisation), by that record’s value in date_hospitalisation. Essentially, you are plotting the density of the x-axis column, but stratifying the results (color =) by a shadow column of interest. This analysis works best if the x-axis is a numeric or date column.

ggplot(data = shadowed_linelist,          # data frame with shadow columns
  mapping = aes(x = date_hospitalisation, # numeric or date column
                colour = age_years_NA)) + # shadow column of interest
  geom_density()                          # plots the density curves

You can also use these “shadow” columns to stratify a statistical summary, as shown below:

linelist %>%
  bind_shadow() %>%                # create the shows cols
  group_by(date_outcome_NA) %>%    # shadow col for stratifying
  summarise(across(
    .cols = age_years,             # variable of interest for calculations
    .fns = list("mean" = mean,     # stats to calculate
                "sd" = sd,
                "var" = var,
                "min" = min,
                "max" = max),  
    na.rm = TRUE))                 # other arguments for the stat calculations
## # A tibble: 2 x 6
##   date_outcome_NA age_years_mean age_years_sd age_years_var age_years_min age_years_max
##   <fct>                    <dbl>        <dbl>         <dbl>         <dbl>         <dbl>
## 1 !NA                       16.0         12.6          158.             0            84
## 2 NA                        16.2         12.9          167.             0            69

An alternative way to plot the proportion of a column’s values that are missing over time is shown below. It does not involve naniar. This example shows percent of weekly observations that are missing).

  1. Aggregate the data into a useful time unit (days, weeks, etc.), summarizing the proportion of observations with NA (and any other values of interest)
  2. Plot the proportion missing as a line using ggplot()

Below, we take the linelist, add a new column for week, group the data by week, and then calculate the percent of that week’s records where the value is missing. (note: if you want % of 7 days the calculation would be slightly different).

outcome_missing <- linelist %>%
  mutate(week = lubridate::floor_date(date_onset, "week")) %>%   # create new week column
  group_by(week) %>%                                             # group the rows by week
  summarise(                                                     # summarize each week
    n_obs = n(),                                                  # number of records
    
    outcome_missing = sum(is.na(outcome) | outcome == ""),        # number of records missing the value
    outcome_p_miss  = outcome_missing / n_obs,                    # proportion of records missing the value
  
    outcome_dead    = sum(outcome == "Death", na.rm=T),           # number of records as dead
    outcome_p_dead  = outcome_dead / n_obs) %>%                   # proportion of records as dead
  
  tidyr::pivot_longer(-week, names_to = "statistic") %>%         # pivot all columns except week, to long format for ggplot
  filter(stringr::str_detect(statistic, "_p_"))                  # keep only the proportion values

Then we plot the proportion missing as a line, by week. The ggplot basics page if you are unfamiliar with the ggplot2 plotting package.

ggplot(data = outcome_missing)+
    geom_line(
      mapping = aes(x = week, y = value, group = statistic, color = statistic),
      size = 2,
      stat = "identity")+
    labs(title = "Weekly outcomes",
         x = "Week",
         y = "Proportion of weekly records") + 
     scale_color_discrete(
       name = "",
       labels = c("Died", "Missing outcome"))+
    scale_y_continuous(breaks = c(seq(0,1,0.1)))+
  theme_minimal()+
  theme(legend.position = "bottom")

20.5 Using data with missing values

Filter out rows with missing values

To quickly remove rows with missing values, use the dplyr function drop_na().

The original linelist has nrow(linelist) rows. The adjusted number of rows is shown below:

linelist %>% 
  drop_na() %>%     # remove rows with ANY missing values
  nrow()
## [1] 1818

You can specify to drop rows with missingness in certain columns:

linelist %>% 
  drop_na(date_onset) %>% # remove rows missing date_onset 
  nrow()
## [1] 5632

You can list columns one after the other, or use “tidyselect” helper functions:

linelist %>% 
  drop_na(contains("date")) %>% # remove rows missing values in any "date" column 
  nrow()
## [1] 3029

Handling NA in ggplot()

It is often wise to report the number of values excluded from a plot in a caption. Below is an example:

In ggplot(), you can add labs() and within it a caption =. In the caption, you can use str_glue() from stringr package to paste values together into a sentence dynamically so they will adjust to the data. An example is below:

  • Note the use of \n for a new line.
  • Note that if multiple column would contribute to values not being plotted (e.g. age or sex if those are reflected in the plot), then you must filter on those columns as well to correctly calculate the number not shown.
labs(
  title = "",
  y = "",
  x = "",
  caption  = stringr::str_glue(
  "n = {nrow(central_data)} from Central Hospital;
  {nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown."))  

Sometimes, it can be easier to save the string as an object in commands prior to the ggplot() command, and simply reference the named string object within the str_glue().

NA in factors

If your column of interest is a factor, use fct_explicit_na() from the forcats package to convert NA values to a character value. See more detail in the Factors page. By default, the new value is “(Missing)” but this can be adjusted via the na_level = argument.

pacman::p_load(forcats)   # load package

linelist <- linelist %>% 
  mutate(gender = fct_explicit_na(gender, na_level = "Missing"))

levels(linelist$gender)
## [1] "f"       "m"       "Missing"

20.6 Imputation

Sometimes, when analyzing your data, it will be important to “fill in the gaps” and impute missing data While you can always simply analyze a dataset after removing all missing values, this can cause problems in many ways. Here are two examples:

  1. By removing all observations with missing values or variables with a large amount of missing data, you might reduce your power or ability to do some types of analysis. For example, as we discovered earlier, only a small fraction of the observations in our linelist dataset have no missing data across all of our variables. If we removed the majority of our dataset we’d be losing a lot of information! And, most of our variables have some amount of missing data–for most analysis it’s probably not reasonable to drop every variable that has a lot of missing data either.

  2. Depending on why your data is missing, analysis of only non-missing data might lead to biased or misleading results. For example, as we learned earlier we are missing data for some patients about whether they’ve had some important symptoms like fever or cough. But, as one possibility, maybe that information wasn’t recorded for people that just obviously weren’t very sick. In that case, if we just removed these observations we’d be excluding some of the healthiest people in our dataset and that might really bias any results.

It’s important to think about why your data might be missing in addition to seeing how much is missing. Doing this can help you decide how important it might be to impute missing data, and also which method of imputing missing data might be best in your situation.

Types of missing data

Here are three general types of missing data:

  1. Missing Completely at Random (MCAR). This means that there is no relationship between the probability of data being missing and any of the other variables in your data. The probability of being missing is the same for all cases This is a rare situation. But, if you have strong reason to believe your data is MCAR analyzing only non-missing data without imputing won’t bias your results (although you may lose some power). [TODO: consider discussing statistical tests for MCAR]

  2. Missing at Random (MAR). This name is actually a bit misleading as MAR means that your data is missing in a systematic, predictable way based on the other information you have. For example, maybe every observation in our dataset with a missing value for fever was actually not recorded because every patient with chills and and aches was just assumed to have a fever so their temperature was never taken. If true, we could easily predict that every missing observation with chills and aches has a fever as well and use this information to impute our missing data. In practice, this is more of a spectrum. Maybe if a patient had both chills and aches they were more likely to have a fever as well if they didn’t have their temperature taken, but not always. This is still predictable even if it isn’t perfectly predictable. This is a common type of missing data

  3. Missing not at Random (MNAR). Sometimes, this is also called Not Missing at Random (NMAR). This assumes that the probability of a value being missing is NOT systematic or predictable using the other information we have but also isn’t missing randomly. In this situation data is missing for unknown reasons or for reasons you don’t have any information about. For example, in our dataset maybe information on age is missing because some very elderly patients either don’t know or refuse to say how old they are. In this situation, missing data on age is related to the value itself (and thus isn’t random) and isn’t predictable based on the other information we have. MNAR is complex and often the best way of dealing with this is to try to collect more data or information about why the data is missing rather than attempt to impute it.

In general, imputing MCAR data is often fairly simple, while MNAR is very challenging if not impossible. Many of the common data imputation methods assume MAR.

Useful packages

Some useful packages for imputing missing data are Mmisc, missForest (which uses random forests to impute missing data), and mice (Multivariate Imputation by Chained Equations). For this section we’ll just use the mice package, which implements a variety of techniques. The maintainer of the mice package has published an online book about imputing missing data that goes into more detail here (https://stefvanbuuren.name/fimd/).

Here is the code to load the mice package:

pacman::p_load(mice)

Mean Imputation

Sometimes if you are doing a simple analysis or you have strong reason to think you can assume MCAR, you can simply set missing numerical values to the mean of that variable. Perhaps we can assume that missing temperature measurements in our dataset were either MCAR or were just normal values. Here is the code to create a new variable that replaces missing temperature values with the mean temperature value in our dataset. However, in many situations replacing data with the mean can lead to bias, so be careful.

linelist <- linelist %>%
  mutate(temp_replace_na_with_mean = replace_na(temp, mean(temp, na.rm = T)))

You could also do a similar process for replacing categorical data with a specific value. For our dataset, imagine you knew that all observations with a missing value for their outcome (which can be “Death” or “Recover”) were actually people that died (note: this is not actually true for this dataset):

linelist <- linelist %>%
  mutate(outcome_replace_na_with_death = replace_na(outcome, "Death"))

Regression imputation

A somewhat more advanced method is to use some sort of statistical model to predict what a missing value is likely to be and replace it with the predicted value. Here is an example of creating predicted values for all the observations where temperature is missing, but age and fever are not, using simple linear regression using fever status and age in years as predictors. In practice you’d want to use a better model than this sort of simple approach.

simple_temperature_model_fit <- lm(temp ~ fever + age_years, data = linelist)

#using our simple temperature model to predict values just for the observations where temp is missing
predictions_for_missing_temps <- predict(simple_temperature_model_fit,
                                        newdata = linelist %>% filter(is.na(temp))) 

Or, using the same modeling approach through the mice package to create imputed values for the missing temperature observations:

model_dataset <- linelist %>%
  select(temp, fever, age_years)  

temp_imputed <- mice(model_dataset,
                            method = "norm.predict",
                            seed = 1,
                            m = 1,
                            print = F)
## Warning: Number of logged events: 1
temp_imputed_values <- temp_imputed$imp$temp

This is the same type of approach by some more advanced methods like using the missForest package to replace missing data with predicted values. In that case, the prediction model is a random forest instead of a linear regression. You can use other types of models to do this as well. However, while this approach works well under MCAR you should be a bit careful if you believe MAR or MNAR more accurately describes your situation. The quality of your imputation will depend on how good your prediction model is and even with a very good model the variability of your imputed data may be underestimated.

LOCF and BOCF

Last observation carried forward (LOCF) and baseline observation carried forward (BOCF) are imputation methods for time series/longitudinal data. The idea is to take the previous observed value as a replacement for the missing data. When multiple values are missing in succession, the method searches for the last observed value.

The fill() function from the tidyr package can be used for both LOCF and BOCF imputation (however, other packages such as HMISC, zoo, and data.table also include methods for doing this). To show the fill() syntax we’ll make up a simple time series dataset containing the number of cases of a disease for each quarter of the years 2000 and 2001. However, the year value for subsequent quarters after Q1 are missing so we’ll need to impute them. The fill() junction is also demonstrated in the Pivoting data page.

#creating our simple dataset
disease <- tibble::tribble(
  ~quarter, ~year, ~cases,
  "Q1",    2000,    66013,
  "Q2",      NA,    69182,
  "Q3",      NA,    53175,
  "Q4",      NA,    21001,
  "Q1",    2001,    46036,
  "Q2",      NA,    58842,
  "Q3",      NA,    44568,
  "Q4",      NA,    50197)

#imputing the missing year values:
disease %>% fill(year)
## # A tibble: 8 x 3
##   quarter  year cases
##   <chr>   <dbl> <dbl>
## 1 Q1       2000 66013
## 2 Q2       2000 69182
## 3 Q3       2000 53175
## 4 Q4       2000 21001
## 5 Q1       2001 46036
## 6 Q2       2001 58842
## 7 Q3       2001 44568
## 8 Q4       2001 50197

Note: make sure your data are sorted correctly before using the fill() function. fill() defaults to filling “down” but you can also impute values in different directions by changing the .direction parameter. We can make a similar dataset where the year value is recorded only at the end of the year and missing for earlier quarters:

#creating our slightly different dataset
disease <- tibble::tribble(
  ~quarter, ~year, ~cases,
  "Q1",      NA,    66013,
  "Q2",      NA,    69182,
  "Q3",      NA,    53175,
  "Q4",    2000,    21001,
  "Q1",      NA,    46036,
  "Q2",      NA,    58842,
  "Q3",      NA,    44568,
  "Q4",    2001,    50197)

#imputing the missing year values in the "up" direction:
disease %>% fill(year, .direction = "up")
## # A tibble: 8 x 3
##   quarter  year cases
##   <chr>   <dbl> <dbl>
## 1 Q1       2000 66013
## 2 Q2       2000 69182
## 3 Q3       2000 53175
## 4 Q4       2000 21001
## 5 Q1       2001 46036
## 6 Q2       2001 58842
## 7 Q3       2001 44568
## 8 Q4       2001 50197

In this example, LOCF and BOCF are clearly the right things to do, but in more complicated situations it may be harder to decide if these methods are appropriate. For example, you may have missing laboratory values for a hospital patient after the first day. Sometimes, this can mean the lab values didn’t change…but it could also mean the patient recovered and their values would be very different after the first day! Use these methods with caution.

Multiple Imputation

The online book we mentioned earlier by the author of the mice package (https://stefvanbuuren.name/fimd/) contains a detailed explanation of multiple imputation and why you’d want to use it. But, here is a basic explanation of the method:

When you do multiple imputation, you create multiple datasets with the missing values imputed to plausible data values (depending on your research data you might want to create more or less of these imputed datasets, but the mice package sets the default number to 5). The difference is that rather than a single, specific value each imputed value is drawn from an estimated distribution (so it includes some randomness). As a result, each of these datasets will have slightly different different imputed values (however, the non-missing data will be the same in each of these imputed datasets). You still use some sort of predictive model to do the imputation in each of these new datasets (mice has many options for prediction methods including Predictive Mean Matching, logistic regression, and random forest) but the mice package can take care of many of the modeling details.

Then, once you have created these new imputed datasets, you can apply then apply whatever statistical model or analysis you were planning to do for each of these new imputed datasets and pool the results of these models together. This works very well to reduce bias in both MCAR and many MAR settings and often results in more accurate standard error estimates.

Here is an example of applying the Multiple Imputation process to predict temperature in our linelist dataset using a age and fever status (our simplified model_dataset from above):

# imputing missing values for all variables in our model_dataset, and creating 10 new imputed datasets
multiple_imputation = mice(
  model_dataset,
  seed = 1,
  m = 10,
  print = FALSE) 
## Warning: Number of logged events: 1
model_fit <- with(multiple_imputation, lm(temp ~ age_years + fever))

base::summary(mice::pool(model_fit))
##          term     estimate    std.error     statistic        df   p.value
## 1 (Intercept) 3.703143e+01 0.0270863456 1367.16240465  26.83673 0.0000000
## 2   age_years 3.867829e-05 0.0006090202    0.06350905 171.44363 0.9494351
## 3    feveryes 1.978044e+00 0.0193587115  102.17849544 176.51325 0.0000000

Here we used the mice default method of imputation, which is Predictive Mean Matching. We then used these imputed datasets to separately estimate and then pool results from simple linear regressions on each of these datasets. There are many details we’ve glossed over and many settings you can adjust during the Multiple Imputation process while using the mice package. For example, you won’t always have numerical data and might need to use other imputation methods (you can still use the mice package for many other types of data and methods). But, for a more robust analysis when missing data is a significant concern, Multiple Imputation is good solution that isn’t always much more work than doing a complete case analysis.

20.7 Resources

Vignette on the naniar package

Gallery of missing value visualizations

Online book about multiple imputation in R by the maintainer of the mice package

21 Standardised rates

This page will show you two ways to standardize an outcome, such as hospitalizations or mortality, by characteristics such as age and sex.

  • Using dsr package
  • Using PHEindicatormethods package

We begin by extensively demonstrating the processes of data preparation/cleaning/joining, as this is common when combining population data from multiple countries, standard population data, deaths, etc.

21.1 Overview

There are two main ways to standardize: direct and indirect standardization. Let’s say we would like to the standardize mortality rate by age and sex for country A and country B, and compare the standardized rates between these countries.

  • For direct standardization, you will have to know the number of the at-risk population and the number of deaths for each stratum of age and sex, for country A and country B. One stratum in our example could be females between ages 15-44.
  • For indirect standardization, you only need to know the total number of deaths and the age- and sex structure of each country. This option is therefore feasible if age- and sex-specific mortality rates or population numbers are not available. Indirect standardization is furthermore preferable in case of small numbers per stratum, as estimates in direct standardization would be influenced by substantial sampling variation.

21.2 Preparation

To show how standardization is done, we will use fictitious population counts and death counts from country A and country B, by age (in 5 year categories) and sex (female, male). To make the datasets ready for use, we will perform the following preparation steps:

  1. Load packages
  2. Load datasets
  3. Join the population and death data from the two countries
  4. Pivot longer so there is one row per age-sex stratum
  5. Clean the reference population (world standard population) and join it to the country data

In your scenario, your data may come in a different format. Perhaps your data are by province, city, or other catchment area. You may have one row for each death and information on age and sex for each (or a significant proportion) of these deaths. In this case, see the pages on Grouping data, Pivoting data, and Descriptive tables to create a dataset with event and population counts per age-sex stratum.

We also need a reference population, the standard population. For the purposes of this exercise we will use the world_standard_population_by_sex. The World standard population is based on the populations of 46 countries and was developed in 1960. There are many “standard” populations - as one example, the website of NHS Scotland is quite informative on the European Standard Population, World Standard Population and Scotland Standard Population.

Load packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
     rio,                 # import/export data
     here,                # locate files
     tidyverse,           # data management and visualization
     stringr,             # cleaning characters and strings
     frailtypack,         # needed for dsr, for frailty models
     dsr,                 # standardise rates
     PHEindicatormethods) # alternative for rate standardisation

CAUTION: If you have a newer version of R, the dsr package cannot be directly downloaded from CRAN. However, it is still available from the CRAN archive. You can install and use this one.

For non-Mac users:

packageurl <- "https://cran.r-project.org/src/contrib/Archive/dsr/dsr_0.2.2.tar.gz"
install.packages(packageurl, repos=NULL, type="source")
# Other solution that may work
require(devtools)
devtools::install_version("dsr", version="0.2.2", repos="http:/cran.us.r.project.org")

For Mac users:

require(devtools)
devtools::install_version("dsr", version="0.2.2", repos="https://mac.R-project.org")

Load population data

See the Download handbook and data page for instructions on how to download all the example data in the handbook. You can import the Standardisation page data directly into R from our Github repository by running the following import() commands:

# import demographics for country A directly from Github
A_demo <- import("https://github.com/appliedepi/epirhandbook_eng/raw/master/data/standardization/country_demographics.csv")

# import deaths for country A directly from Github
A_deaths <- import("https://github.com/appliedepi/epirhandbook_eng/raw/master/data/standardization/deaths_countryA.csv")

# import demographics for country B directly from Github
B_demo <- import("https://github.com/appliedepi/epirhandbook_eng/raw/master/data/standardization/country_demographics_2.csv")

# import deaths for country B directly from Github
B_deaths <- import("https://github.com/appliedepi/epirhandbook_eng/raw/master/data/standardization/deaths_countryB.csv")

# import demographics for country B directly from Github
standard_pop_data <- import("https://github.com/appliedepi/epirhandbook_eng/raw/master/data/standardization/world_standard_population_by_sex.csv")

First we load the demographic data (counts of males and females by 5-year age category) for the two countries that we will be comparing, “Country A” and “Country B”.

# Country A
A_demo <- import("country_demographics.csv")
# Country B
B_demo <- import("country_demographics_2.csv")

Load death counts

Conveniently, we also have the counts of deaths during the time period of interest, by age and sex. Each country’s counts are in a separate file, shown below.

Deaths in Country A

Deaths in Country B

Clean populations and deaths

We need to join and transform these data in the following ways:

  • Combine country populations into one dataset and pivot “long” so that each age-sex stratum is one row
  • Combine country death counts into one dataset and pivot “long” so each age-sex stratum is one row
  • Join the deaths to the populations

First, we combine the country populations datasets, pivot longer, and do minor cleaning. See the page on Pivoting data for more detail.

pop_countries <- A_demo %>%  # begin with country A dataset
     bind_rows(B_demo) %>%        # bind rows, because cols are identically named
     pivot_longer(                       # pivot longer
          cols = c(m, f),                   # columns to combine into one
          names_to = "Sex",                 # name for new column containing the category ("m" or "f") 
          values_to = "Population") %>%     # name for new column containing the numeric values pivoted
     mutate(Sex = recode(Sex,            # re-code values for clarity
          "m" = "Male",
          "f" = "Female"))

The combined population data now look like this (click through to see countries A and B):

And now we perform similar operations on the two deaths datasets.

deaths_countries <- A_deaths %>%    # begin with country A deaths dataset
     bind_rows(B_deaths) %>%        # bind rows with B dataset, because cols are identically named
     pivot_longer(                  # pivot longer
          cols = c(Male, Female),        # column to transform into one
          names_to = "Sex",              # name for new column containing the category ("m" or "f") 
          values_to = "Deaths") %>%      # name for new column containing the numeric values pivoted
     rename(age_cat5 = AgeCat)      # rename for clarity

The deaths data now look like this, and contain data from both countries:

We now join the deaths and population data based on common columns Country, age_cat5, and Sex. This adds the column Deaths.

country_data <- pop_countries %>% 
     left_join(deaths_countries, by = c("Country", "age_cat5", "Sex"))

We can now classify Sex, age_cat5, and Country as factors and set the level order using fct_relevel() function from the forcats package, as described in the page on Factors. Note, classifying the factor levels doesn’t visibly change the data, but the arrange() command does sort it by Country, age category, and sex.

country_data <- country_data %>% 
  mutate(
    Country = fct_relevel(Country, "A", "B"),
      
    Sex = fct_relevel(Sex, "Male", "Female"),
        
    age_cat5 = fct_relevel(
      age_cat5,
      "0-4", "5-9", "10-14", "15-19",
      "20-24", "25-29",  "30-34", "35-39",
      "40-44", "45-49", "50-54", "55-59",
      "60-64", "65-69", "70-74",
      "75-79", "80-84", "85")) %>% 
          
  arrange(Country, age_cat5, Sex)

CAUTION: If you have few deaths per stratum, consider using 10-, or 15-year categories, instead of 5-year categories for age.

Load reference population

Lastly, for the direct standardisation, we import the reference population (world “standard population” by sex)

# Reference population
standard_pop_data <- import("world_standard_population_by_sex.csv")

Clean reference population

The age category values in the country_data and standard_pop_data data frames will need to be aligned.

Currently, the values of the column age_cat5 from the standard_pop_data data frame contain the word “years” and “plus”, while those of the country_data data frame do not. We will have to make the age category values match. We use str_replace_all() from the stringr package, as described in the page on Characters and strings, to replace these patterns with no space "".

Furthermore, the package dsr expects that in the standard population, the column containing counts will be called "pop". So we rename that column accordingly.

# Remove specific string from column values
standard_pop_clean <- standard_pop_data %>%
     mutate(
          age_cat5 = str_replace_all(age_cat5, "years", ""),   # remove "year"
          age_cat5 = str_replace_all(age_cat5, "plus", ""),    # remove "plus"
          age_cat5 = str_replace_all(age_cat5, " ", "")) %>%   # remove " " space
     
     rename(pop = WorldStandardPopulation)   # change col name to "pop", as this is expected by dsr package

CAUTION: If you try to use str_replace_all() to remove a plus symbol, it won’t work because it is a special symbol. “Escape” the specialnes by putting two back slashes in front, as in str_replace_call(column, "\\+", "").

Create dataset with standard population

Finally, the package PHEindicatormethods, detailed below, expects the standard populations joined to the country event and population counts. So, we will create a dataset all_data for that purpose.

all_data <- left_join(country_data, standard_pop_clean, by=c("age_cat5", "Sex"))

This complete dataset looks like this:

21.3 dsr package

Below we demonstrate calculating and comparing directly standardized rates using the dsr package. The dsr package allows you to calculate and compare directly standardized rates (no indirectly standardized rates!).

In the data Preparation section, we made separate datasets for country counts and standard population:

  1. the country_data object, which is a population table with the number of population and number of deaths per stratum per country
  2. the standard_pop_clean object, containing the number of population per stratum for our reference population, the World Standard Population

We will use these separate datasets for the dsr approach.

Standardized rates

Below, we calculate rates per country directly standardized for age and sex. We use the dsr() function.

Of note - dsr() expects one data frame for the country populations and event counts (deaths), and a separate data frame with the reference population. It also expects that in this reference population dataset the unit-time column name is “pop” (we assured this in the data Preparation section).

There are many arguments, as annotated in the code below. Notably, event = is set to the column Deaths, and the fu = (“follow-up”) is set to the Population column. We set the subgroups of comparison as the column Country and we standardize based on age_cat5 and Sex. These last two columns are not assigned a particular named argument. See ?dsr for details.

# Calculate rates per country directly standardized for age and sex
mortality_rate <- dsr::dsr(
     data = country_data,  # specify object containing number of deaths per stratum
     event = Deaths,       # column containing number of deaths per stratum 
     fu = Population,      # column containing number of population per stratum
     subgroup = Country,   # units we would like to compare
     age_cat5,             # other columns - rates will be standardized by these
     Sex,
     refdata = standard_pop_clean, # reference population data frame, with column called pop
     method = "gamma",      # method to calculate 95% CI
     sig = 0.95,            # significance level
     mp = 100000,           # we want rates per 100.000 population
     decimals = 2)          # number of decimals)


# Print output as nice-looking HTML table
knitr::kable(mortality_rate) # show mortality rate before and after direct standardization
Subgroup Numerator Denominator Crude Rate (per 100000) 95% LCL (Crude) 95% UCL (Crude) Std Rate (per 100000) 95% LCL (Std) 95% UCL (Std)
A 11344 86790567 13.07 12.83 13.31 23.57 23.08 24.06
B 9955 52898281 18.82 18.45 19.19 19.33 18.46 20.22

Above, we see that while country A had a lower crude mortality rate than country B, it has a higher standardized rate after direct age and sex standardization.

Standardized rate ratios

# Calculate RR
mortality_rr <- dsr::dsrr(
     data = country_data, # specify object containing number of deaths per stratum
     event = Deaths,      # column containing number of deaths per stratum 
     fu = Population,     # column containing number of population per stratum
     subgroup = Country,  # units we would like to compare
     age_cat5,
     Sex,                 # characteristics to which we would like to standardize 
     refdata = standard_pop_clean, # reference population, with numbers in column called pop
     refgroup = "B",      # reference for comparison
     estimate = "ratio",  # type of estimate
     sig = 0.95,          # significance level
     mp = 100000,         # we want rates per 100.000 population
     decimals = 2)        # number of decimals

# Print table
knitr::kable(mortality_rr) 
Comparator Reference Std Rate (per 100000) Rate Ratio (RR) 95% LCL (RR) 95% UCL (RR)
A B 23.57 1.22 1.17 1.27
B B 19.33 1.00 0.94 1.06

The standardized mortality rate is 1.22 times higher in country A compared to country B (95% CI 1.17-1.27).

Standardized rate difference

# Calculate RD
mortality_rd <- dsr::dsrr(
     data = country_data,       # specify object containing number of deaths per stratum
     event = Deaths,            # column containing number of deaths per stratum 
     fu = Population,           # column containing number of population per stratum
     subgroup = Country,        # units we would like to compare
     age_cat5,                  # characteristics to which we would like to standardize
     Sex,                        
     refdata = standard_pop_clean, # reference population, with numbers in column called pop
     refgroup = "B",            # reference for comparison
     estimate = "difference",   # type of estimate
     sig = 0.95,                # significance level
     mp = 100000,               # we want rates per 100.000 population
     decimals = 2)              # number of decimals

# Print table
knitr::kable(mortality_rd) 
Comparator Reference Std Rate (per 100000) Rate Difference (RD) 95% LCL (RD) 95% UCL (RD)
A B 23.57 4.24 3.24 5.24
B B 19.33 0.00 -1.24 1.24

Country A has 4.24 additional deaths per 100.000 population (95% CI 3.24-5.24) compared to country A.

21.4 PHEindicatormethods package

Another way of calculating standardized rates is with the PHEindicatormethods package. This package allows you to calculate directly as well as indirectly standardized rates. We will show both.

This section will use the all_data data frame created at the end of the Preparation section. This data frame includes the country populations, death events, and the world standard reference population. You can view it here.

Directly standardized rates

Below, we first group the data by Country and then pass it to the function phe_dsr() to get directly standardized rates per country.

Of note - the reference (standard) population can be provided as a column within the country-specific data frame or as a separate vector. If provided within the country-specific data frame, you have to set stdpoptype = "field". If provided as a vector, set stdpoptype = "vector". In the latter case, you have to make sure the ordering of rows by strata is similar in both the country-specific data frame and the reference population, as records will be matched by position. In our example below, we provided the reference population as a column within the country-specific data frame.

See the help with ?phr_dsr or the links in the References section for more information.

# Calculate rates per country directly standardized for age and sex
mortality_ds_rate_phe <- all_data %>%
     group_by(Country) %>%
     PHEindicatormethods::phe_dsr(
          x = Deaths,                 # column with observed number of events
          n = Population,             # column with non-standard pops for each stratum
          stdpop = pop,               # standard populations for each stratum
          stdpoptype = "field")       # either "vector" for a standalone vector or "field" meaning std populations are in the data  

# Print table
knitr::kable(mortality_ds_rate_phe)
Country total_count total_pop value lowercl uppercl confidence statistic method
A 11344 86790567 23.56686 23.08107 24.05944 95% dsr per 100000 Dobson
B 9955 52898281 19.32549 18.45516 20.20882 95% dsr per 100000 Dobson

Indirectly standardized rates

For indirect standardization, you need a reference population with the number of deaths and number of population per stratum. In this example, we will be calculating rates for country A using country B as the reference population, as the standard_pop_clean reference population does not include number of deaths per stratum.

Below, we first create the reference population from country B. Then, we pass mortality and population data for country A, combine it with the reference population, and pass it to the function phe_isr(), to get indirectly standardized rates. Of course, you can do it also vice versa.

Of note - in our example below, the reference population is provided as a separate data frame. In this case, we make sure that x =, n =, x_ref = and n_ref = vectors are all ordered by the same standardization category (stratum) values as that in our country-specific data frame, as records will be matched by position.

See the help with ?phr_isr or the links in the References section for more information.

# Create reference population
refpopCountryB <- country_data %>% 
  filter(Country == "B") 

# Calculate rates for country A indirectly standardized by age and sex
mortality_is_rate_phe_A <- country_data %>%
     filter(Country == "A") %>%
     PHEindicatormethods::phe_isr(
          x = Deaths,                 # column with observed number of events
          n = Population,             # column with non-standard pops for each stratum
          x_ref = refpopCountryB$Deaths,  # reference number of deaths for each stratum
          n_ref = refpopCountryB$Population)  # reference population for each stratum

# Print table
knitr::kable(mortality_is_rate_phe_A)
observed expected ref_rate value lowercl uppercl confidence statistic method
11344 15847.42 18.81914 13.47123 13.22446 13.72145 95% isr per 100000 Byars

21.5 Resources

If you would like to see another reproducible example using dsr please see this vignette

For another example using PHEindicatormethods, please go to this website

See the PHEindicatormethods reference pdf file

22 Moving averages

This page will cover two methods to calculate and visualize moving averages:

  1. Calculate with the slider package
  2. Calculate within a ggplot() command with the tidyquant package

22.1 Preparation

Load packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
  tidyverse,      # for data management and viz
  slider,         # for calculating moving averages
  tidyquant       # for calculating moving averages within ggplot
)

Import data

We import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import data with the import() function from the rio package (it handles many file types like .xlsx, .csv, .rds - see the Import and export page for details).

# import the linelist
linelist <- import("linelist_cleaned.xlsx")

The first 50 rows of the linelist are displayed below.

22.2 Calculate with slider

Use this approach to calculate a moving average in a data frame prior to plotting.

The slider package provides several “sliding window” functions to compute rolling averages, cumulative sums, rolling regressions, etc. It treats a data frame as a vector of rows, allowing iteration row-wise over a data frame.

Here are some of the common functions:

  • slide_dbl() - iterates through a numeric (hence "_dbl") column performing an operation using a sliding window
    • slide_sum() - rolling sum shortcut function for slide_dbl()
    • slide_mean() - rolling average shortcut function for slide_dbl()
  • slide_index_dbl() - applies the rolling window on a numeric column using a separate column to index the window progression (useful if rolling by date with some dates absent)
    • slide_index_sum() - rolling sum shortcut function with indexing
    • slide_index_mean() - rolling mean shortcut function with indexing

The slider package has many other functions that are covered in the Resources section of this page. We briefly touch upon the most common.

Core arguments

  • .x, the first argument by default, is the vector to iterate over and to apply the function to
  • .i = for the “index” versions of the slider functions - provide a column to “index” the roll on (see section below)
  • .f =, the second argument by default, either:
    • A function, written without parentheses, like mean, or
    • A formula, which will be converted into a function. For example ~ .x - mean(.x) will return the result of the current value minus the mean of the window’s value
  • For more details see this reference material

Window size

Specify the size of the window by using either .before, .after, or both arguments:

  • .before = - Provide an integer
  • .after = - Provide an integer
  • .complete = - Set this to TRUE if you only want calculation performed on complete windows

For example, to achieve a 7-day window including the current value and the six previous, use .before = 6. To achieve a “centered” window provide the same number to both .before = and .after =.

By default, .complete = will be FALSE so if the full window of rows does not exist, the functions will use available rows to perform the calculation. Setting to TRUE restricts so calculations are only performed on complete windows.

Expanding window

To achieve cumulative operations, set the .before = argument to Inf. This will conduct the operation on the current value and all coming before.

Rolling by date

The most likely use-case of a rolling calculation in applied epidemiology is to examine a metric over time. For example, a rolling measurement of case incidence, based on daily case counts.

If you have clean time series data with values for every date, you may be OK to use slide_dbl(), as demonstrated here in the Time series and outbreak detection page.

However, in many applied epidemiology circumstances you may have dates absent from your data, where there are no events recorded. In these cases, it is best to use the “index” versions of the slider functions.

Indexed data

Below, we show an example using slide_index_dbl() on the case linelist. Let us say that our objective is to calculate a rolling 7-day incidence - the sum of cases using a rolling 7-day window. If you are looking for an example of rolling average, see the section below on grouped rolling.

To begin, the dataset daily_counts is created to reflect the daily case counts from the linelist, as calculated with count() from dplyr.

# make dataset of daily counts
daily_counts <- linelist %>% 
  count(date_hospitalisation, name = "new_cases")

Here is the daily_counts data frame - there are nrow(daily_counts) rows, each day is represented by one row, but especially early in the epidemic some days are not present (there were no cases admitted on those days).

It is crucial to recognize that a standard rolling function (like slide_dbl() would use a window of 7 rows, not 7 days. So, if there are any absent dates, some windows will actually extend more than 7 calendar days!

A “smart” rolling window can be achieved with slide_index_dbl(). The “index” means that the function uses a separate column as an “index” for the rolling window. The window is not simply based on the rows of the data frame.

If the index column is a date, you have the added ability to specify the window extent to .before = and/or .after = in units of lubridate days() or months(). If you do these things, the function will include absent days in the windows as if they were there (as NA values).

Let’s show a comparison. Below, we calculate rolling 7-day case incidence with regular and indexed windows.

rolling <- daily_counts %>% 
  mutate(                                # create new columns
    # Using slide_dbl()
    ###################
    reg_7day = slide_dbl(
      new_cases,                         # calculate on new_cases
      .f = ~sum(.x, na.rm = T),          # function is sum() with missing values removed
      .before = 6),                      # window is the ROW and 6 prior ROWS
    
    # Using slide_index_dbl()
    #########################
    indexed_7day = slide_index_dbl(
        new_cases,                       # calculate on new_cases
        .i = date_hospitalisation,       # indexed with date_onset 
        .f = ~sum(.x, na.rm = TRUE),     # function is sum() with missing values removed
        .before = days(6))               # window is the DAY and 6 prior DAYS
    )

Observe how in the regular column for the first 7 rows the count steadily increases despite the rows not being within 7 days of each other! The adjacent “indexed” column accounts for these absent calendar days, so its 7-day sums are much lower, at least in this period of the epidemic when the cases a farther between.

Now you can plot these data using ggplot():

ggplot(data = rolling)+
  geom_line(mapping = aes(x = date_hospitalisation, y = indexed_7day), size = 1)

Rolling by group

If you group your data prior to using a slider function, the sliding windows will be applied by group. Be careful to arrange your rows in the desired order by group.

Each time a new group begins, the sliding window will re-start. Therefore, one nuance to be aware of is that if your data are grouped and you have set .complete = TRUE, you will have empty values at each transition between groups. As the function moved downward through the rows, every transition in the grouping column will re-start the accrual of the minimum window size to allow a calculation.

See handbook page on Grouping data for details on grouping data.

Below, we count linelist cases by date and by hospital. Then we arrange the rows in ascending order, first ordering by hospital and then within that by date. Next we set group_by(). Then we can create our new rolling average.

grouped_roll <- linelist %>%

  count(hospital, date_hospitalisation, name = "new_cases") %>% 

  arrange(hospital, date_hospitalisation) %>%   # arrange rows by hospital and then by date
  
  group_by(hospital) %>%              # group by hospital 
    
  mutate(                             # rolling average  
    mean_7day_hosp = slide_index_dbl(
      .x = new_cases,                 # the count of cases per hospital-day
      .i = date_hospitalisation,      # index on date of admission
      .f = mean,                      # use mean()                   
      .before = days(6)               # use the day and the 6 days prior
      )
  )

Here is the new dataset:

We can now plot the moving averages, displaying the data by group by specifying ~ hospital to facet_wrap() in ggplot(). For fun, we plot two geometries - a geom_col() showing the daily case counts and a geom_line() showing the 7-day moving average.

ggplot(data = grouped_roll)+
  geom_col(                       # plot daly case counts as grey bars
    mapping = aes(
      x = date_hospitalisation,
      y = new_cases),
    fill = "grey",
    width = 1)+
  geom_line(                      # plot rolling average as line colored by hospital
    mapping = aes(
      x = date_hospitalisation,
      y = mean_7day_hosp,
      color = hospital),
    size = 1)+
  facet_wrap(~hospital, ncol = 2)+ # create mini-plots per hospital
  theme_classic()+                 # simplify background  
  theme(legend.position = "none")+ # remove legend
  labs(                            # add plot labels
    title = "7-day rolling average of daily case incidence",
    x = "Date of admission",
    y = "Case incidence")

DANGER: If you get an error saying “slide() was deprecated in tsibble 0.9.0 and is now defunct. Please use slider::slide() instead.”, it means that the slide() function from the tsibble package is masking the slide() function from slider package. Fix this by specifying the package in the command, such as slider::slide_dbl().

22.3 Calculate with tidyquant within ggplot()

The package tidyquant offers another approach to calculating moving averages - this time from within a ggplot() command itself.

Below the linelist data are counted by date of onset, and this is plotted as a faded line (alpha < 1). Overlaid on top is a line created with geom_ma() from the package tidyquant, with a set window of 7 days (n = 7) with specified color and thickness.

By default geom_ma() uses a simple moving average (ma_fun = "SMA"), but other types can be specified, such as:

  • “EMA” - exponential moving average (more weight to recent observations)
  • “WMA” - weighted moving average (wts are used to weight observations in the moving average)
  • Others can be found in the function documentation
linelist %>% 
  count(date_onset) %>%                 # count cases per day
  drop_na(date_onset) %>%               # remove cases missing onset date
  ggplot(aes(x = date_onset, y = n))+   # start ggplot
    geom_line(                          # plot raw values
      size = 1,
      alpha = 0.2                       # semi-transparent line
      )+             
    tidyquant::geom_ma(                 # plot moving average
      n = 7,           
      size = 1,
      color = "blue")+ 
  theme_minimal()                       # simple background

See this vignette for more details on the options available within tidyquant.

22.4 Resources

See the helpful online vignette for the slider package

The slider github page

A slider vignette

tidyquant vignette

If your use case requires that you “skip over” weekends and even holidays, you might like almanac package.

23 Time series and outbreak detection

23.1 Overview

This tab demonstrates the use of several packages for time series analysis. It primarily relies on packages from the tidyverts family, but will also use the RECON trending package to fit models that are more appropriate for infectious disease epidemiology.

Note in the below example we use a dataset from the surveillance package on Campylobacter in Germany (see the data chapter, of the handbook for details). However, if you wanted to run the same code on a dataset with multiple countries or other strata, then there is an example code template for this in the r4epis github repo.

Topics covered include:

  1. Time series data
  2. Descriptive analysis
  3. Fitting regressions
  4. Relation of two time series
  5. Outbreak detection
  6. Interrupted time series

23.2 Preparation

Packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(rio,          # File import
               here,         # File locator
               tidyverse,    # data management + ggplot2 graphics
               tsibble,      # handle time series datasets
               slider,       # for calculating moving averages
               imputeTS,     # for filling in missing values
               feasts,       # for time series decomposition and autocorrelation
               forecast,     # fit sin and cosin terms to data (note: must load after feasts)
               trending,     # fit and assess models 
               tmaptools,    # for getting geocoordinates (lon/lat) based on place names
               ecmwfr,       # for interacting with copernicus sateliate CDS API
               stars,        # for reading in .nc (climate data) files
               units,        # for defining units of measurement (climate data)
               yardstick,    # for looking at model accuracy
               surveillance  # for aberration detection
               )

Load data

You can download all the data used in this handbook via the instructions in the Download handbook and data page.

The example dataset used in this section is weekly counts of campylobacter cases reported in Germany between 2001 and 2011. You can click here to download this data file (.xlsx).

This dataset is a reduced version of the dataset available in the surveillance package. (for details load the surveillance package and see ?campyDE)

Import these data with the import() function from the rio package (it handles many file types like .xlsx, .csv, .rds - see the Import and export page for details).

# import the counts into R
counts <- rio::import("campylobacter_germany.xlsx")

The first 10 rows of the counts are displayed below.

Clean data

The code below makes sure that the date column is in the appropriate format. For this tab we will be using the tsibble package and so the yearweek function will be used to create a calendar week variable. There are several other ways of doing this (see the Working with dates page for details), however for time series its best to keep within one framework (tsibble).

## ensure the date column is in the appropriate format
counts$date <- as.Date(counts$date)

## create a calendar week variable 
## fitting ISO definitons of weeks starting on a monday
counts <- counts %>% 
     mutate(epiweek = yearweek(date, week_start = 1))

Download climate data

In the relation of two time series section of this page, we will be comparing campylobacter case counts to climate data.

Climate data for anywhere in the world can be downloaded from the EU’s Copernicus Satellite. These are not exact measurements, but based on a model (similar to interpolation), however the benefit is global hourly coverage as well as forecasts.

You can download each of these climate data files from the Download handbook and data page.

For purposes of demonstration here, we will show R code to use the ecmwfr package to pull these data from the Copernicus climate data store. You will need to create a free account in order for this to work. The package website has a useful walkthrough of how to do this. Below is example code of how to go about doing this, once you have the appropriate API keys. You have to replace the X’s below with your account IDs. You will need to download one year of data at a time otherwise the server times-out.

If you are not sure of the coordinates for a location you want to download data for, you can use the tmaptools package to pull the coordinates off open street maps. An alternative option is the photon package, however this has not been released on to CRAN yet; the nice thing about photon is that it provides more contextual data for when there are several matches for your search.

## retrieve location coordinates
coords <- geocode_OSM("Germany", geometry = "point")

## pull together long/lats in format for ERA-5 querying (bounding box) 
## (as just want a single point can repeat coords)
request_coords <- str_glue_data(coords$coords, "{y}/{x}/{y}/{x}")


## Pulling data modelled from copernicus satellite (ERA-5 reanalysis)
## https://cds.climate.copernicus.eu/cdsapp#!/software/app-era5-explorer?tab=app
## https://github.com/bluegreen-labs/ecmwfr

## set up key for weather data 
wf_set_key(user = "XXXXX",
           key = "XXXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXX",
           service = "cds") 

## run for each year of interest (otherwise server times out)
for (i in 2002:2011) {
  
  ## pull together a query 
  ## see here for how to do: https://bluegreen-labs.github.io/ecmwfr/articles/cds_vignette.html#the-request-syntax
  ## change request to a list using addin button above (python to list)
  ## Target is the name of the output file!!
  request <- request <- list(
    product_type = "reanalysis",
    format = "netcdf",
    variable = c("2m_temperature", "total_precipitation"),
    year = c(i),
    month = c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"),
    day = c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12",
            "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24",
            "25", "26", "27", "28", "29", "30", "31"),
    time = c("00:00", "01:00", "02:00", "03:00", "04:00", "05:00", "06:00", "07:00",
             "08:00", "09:00", "10:00", "11:00", "12:00", "13:00", "14:00", "15:00",
             "16:00", "17:00", "18:00", "19:00", "20:00", "21:00", "22:00", "23:00"),
    area = request_coords,
    dataset_short_name = "reanalysis-era5-single-levels",
    target = paste0("germany_weather", i, ".nc")
  )
  
  ## download the file and store it in the current working directory
  file <- wf_request(user     = "XXXXX",  # user ID (for authentication)
                     request  = request,  # the request
                     transfer = TRUE,     # download the file
                     path     = here::here("data", "Weather")) ## path to save the data
  }

Load climate data

Whether you downloaded the climate data via our handbook, or used the code above, you now should have 10 years of “.nc” climate data files stored in the same folder on your computer.

Use the code below to import these files into R with the stars package.

## define path to weather folder 
file_paths <- list.files(
  here::here("data", "time_series", "weather"), # replace with your own file path 
  full.names = TRUE)

## only keep those with the current name of interest 
file_paths <- file_paths[str_detect(file_paths, "germany")]

## read in all the files as a stars object 
data <- stars::read_stars(file_paths)
## t2m, tp, 
## t2m, tp, 
## t2m, tp, 
## t2m, tp, 
## t2m, tp, 
## t2m, tp, 
## t2m, tp, 
## t2m, tp, 
## t2m, tp, 
## t2m, tp,

Once these files have been imported as the object data, we will convert them to a data frame.

## change to a data frame 
temp_data <- as_tibble(data) %>% 
  ## add in variables and correct units
  mutate(
    ## create an calendar week variable 
    epiweek = tsibble::yearweek(time), 
    ## create a date variable (start of calendar week)
    date = as.Date(epiweek),
    ## change temperature from kelvin to celsius
    t2m = set_units(t2m, celsius), 
    ## change precipitation from metres to millimetres 
    tp  = set_units(tp, mm)) %>% 
  ## group by week (keep the date too though)
  group_by(epiweek, date) %>% 
  ## get the average per week
  summarise(t2m = as.numeric(mean(t2m)), 
            tp = as.numeric(mean(tp)))
## `summarise()` has grouped output by 'epiweek'. You can override using the `.groups` argument.

23.3 Time series data

There are a number of different packages for structuring and handling time series data. As said, we will focus on the tidyverts family of packages and so will use the tsibble package to define our time series object. Having a data set defined as a time series object means it is much easier to structure our analysis.

To do this we use the tsibble() function and specify the “index”, i.e. the variable specifying the time unit of interest. In our case this is the epiweek variable.

If we had a data set with weekly counts by province, for example, we would also be able to specify the grouping variable using the key = argument. This would allow us to do analysis for each group.

## define time series object 
counts <- tsibble(counts, index = epiweek)

Looking at class(counts) tells you that on top of being a tidy data frame (“tbl_df”, “tbl”, “data.frame”), it has the additional properties of a time series data frame (“tbl_ts”).

You can take a quick look at your data by using ggplot2. We see from the plot that there is a clear seasonal pattern, and that there are no missings. However, there seems to be an issue with reporting at the beginning of each year; cases drop in the last week of the year and then increase for the first week of the next year.

## plot a line graph of cases by week
ggplot(counts, aes(x = epiweek, y = case)) + 
     geom_line()

DANGER: Most datasets aren’t as clean as this example. You will need to check for duplicates and missings as below.

Duplicates

tsibble does not allow duplicate observations. So each row will need to be unique, or unique within the group (key variable). The package has a few functions that help to identify duplicates. These include are_duplicated() which gives you a TRUE/FALSE vector of whether the row is a duplicate, and duplicates() which gives you a data frame of the duplicated rows.

See the page on De-duplication for more details on how to select rows you want.

## get a vector of TRUE/FALSE whether rows are duplicates
are_duplicated(counts, index = epiweek) 

## get a data frame of any duplicated rows 
duplicates(counts, index = epiweek) 

Missings

We saw from our brief inspection above that there are no missings, but we also saw there seems to be a problem with reporting delay around new year. One way to address this problem could be to set these values to missing and then to impute values. The simplest form of time series imputation is to draw a straight line between the last non-missing and the next non-missing value. To do this we will use the imputeTS package function na_interpolation().

See the Missing data page for other options for imputation.

Another alternative would be to calculate a moving average, to try and smooth over these apparent reporting issues (see next section, and the page on Moving averages).

## create a variable with missings instead of weeks with reporting issues
counts <- counts %>% 
     mutate(case_miss = if_else(
          ## if epiweek contains 52, 53, 1 or 2
          str_detect(epiweek, "W51|W52|W53|W01|W02"), 
          ## then set to missing 
          NA_real_, 
          ## otherwise keep the value in case
          case
     ))

## alternatively interpolate missings by linear trend 
## between two nearest adjacent points
counts <- counts %>% 
  mutate(case_int = imputeTS::na_interpolation(case_miss)
         )

## to check what values have been imputed compared to the original
ggplot_na_imputations(counts$case_miss, counts$case_int) + 
  ## make a traditional plot (with black axes and white background)
  theme_classic()

23.4 Descriptive analysis

Moving averages

If data is very noisy (counts jumping up and down) then it can be helpful to calculate a moving average. In the example below, for each week we calculate the average number of cases from the four previous weeks. This smooths the data, to make it more interpretable. In our case this does not really add much, so we will stick to the interpolated data for further analysis. See the Moving averages page for more detail.

## create a moving average variable (deals with missings)
counts <- counts %>% 
     ## create the ma_4w variable 
     ## slide over each row of the case variable
     mutate(ma_4wk = slider::slide_dbl(case, 
                               ## for each row calculate the name
                               ~ mean(.x, na.rm = TRUE),
                               ## use the four previous weeks
                               .before = 4))

## make a quick visualisation of the difference 
ggplot(counts, aes(x = epiweek)) + 
     geom_line(aes(y = case)) + 
     geom_line(aes(y = ma_4wk), colour = "red")

Periodicity

Below we define a custom function to create a periodogram. See the Writing functions page for information about how to write functions in R.

First, the function is defined. Its arguments include a dataset with a column counts, start_week = which is the first week of the dataset, a number to indicate how many periods per year (e.g. 52, 12), and lastly the output style (see details in the code below).

## Function arguments
#####################
## x is a dataset
## counts is variable with count data or rates within x 
## start_week is the first week in your dataset
## period is how many units in a year 
## output is whether you want return spectral periodogram or the peak weeks
  ## "periodogram" or "weeks"

# Define function
periodogram <- function(x, 
                        counts, 
                        start_week = c(2002, 1), 
                        period = 52, 
                        output = "weeks") {
  

    ## make sure is not a tsibble, filter to project and only keep columns of interest
    prepare_data <- dplyr::as_tibble(x)
    
    # prepare_data <- prepare_data[prepare_data[[strata]] == j, ]
    prepare_data <- dplyr::select(prepare_data, {{counts}})
    
    ## create an intermediate "zoo" time series to be able to use with spec.pgram
    zoo_cases <- zoo::zooreg(prepare_data, 
                             start = start_week, frequency = period)
    
    ## get a spectral periodogram not using fast fourier transform 
    periodo <- spec.pgram(zoo_cases, fast = FALSE, plot = FALSE)
    
    ## return the peak weeks 
    periodo_weeks <- 1 / periodo$freq[order(-periodo$spec)] * period
    
    if (output == "weeks") {
      periodo_weeks
    } else {
      periodo
    }
    
}

## get spectral periodogram for extracting weeks with the highest frequencies 
## (checking of seasonality) 
periodo <- periodogram(counts, 
                       case_int, 
                       start_week = c(2002, 1),
                       output = "periodogram")

## pull spectrum and frequence in to a dataframe for plotting
periodo <- data.frame(periodo$freq, periodo$spec)

## plot a periodogram showing the most frequently occuring periodicity 
ggplot(data = periodo, 
                aes(x = 1/(periodo.freq/52),  y = log(periodo.spec))) + 
  geom_line() + 
  labs(x = "Period (Weeks)", y = "Log(density)")

## get a vector weeks in ascending order 
peak_weeks <- periodogram(counts, 
                          case_int, 
                          start_week = c(2002, 1), 
                          output = "weeks")

NOTE: It is possible to use the above weeks to add them to sin and cosine terms, however we will use a function to generate these terms (see regression section below)

Decomposition

Classical decomposition is used to break a time series down several parts, which when taken together make up for the pattern you see. These different parts are:

  • The trend-cycle (the long-term direction of the data)
  • The seasonality (repeating patterns)
  • The random (what is left after removing trend and season)
## decompose the counts dataset 
counts %>% 
  # using an additive classical decomposition model
  model(classical_decomposition(case_int, type = "additive")) %>% 
  ## extract the important information from the model
  components() %>% 
  ## generate a plot 
  autoplot()

Autocorrelation

Autocorrelation tells you about the relation between the counts of each week and the weeks before it (called lags).

Using the ACF() function, we can produce a plot which shows us a number of lines for the relation at different lags. Where the lag is 0 (x = 0), this line would always be 1 as it shows the relation between an observation and itself (not shown here). The first line shown here (x = 1) shows the relation between each observation and the observation before it (lag of 1), the second shows the relation between each observation and the observation before last (lag of 2) and so on until lag of 52 which shows the relation between each observation and the observation from 1 year (52 weeks before).

Using the PACF() function (for partial autocorrelation) shows the same type of relation but adjusted for all other weeks between. This is less informative for determining periodicity.

## using the counts dataset
counts %>% 
  ## calculate autocorrelation using a full years worth of lags
  ACF(case_int, lag_max = 52) %>% 
  ## show a plot
  autoplot()

## using the counts data set 
counts %>% 
  ## calculate the partial autocorrelation using a full years worth of lags
  PACF(case_int, lag_max = 52) %>% 
  ## show a plot
  autoplot()

You can formally test the null hypothesis of independence in a time series (i.e.  that it is not autocorrelated) using the Ljung-Box test (in the stats package). A significant p-value suggests that there is autocorrelation in the data.

## test for independance 
Box.test(counts$case_int, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  counts$case_int
## X-squared = 462.65, df = 1, p-value < 2.2e-16

23.5 Fitting regressions

It is possible to fit a large number of different regressions to a time series, however, here we will demonstrate how to fit a negative binomial regression - as this is often the most appropriate for counts data in infectious diseases.

Fourier terms

Fourier terms are the equivalent of sin and cosin curves. The difference is that these are fit based on finding the most appropriate combination of curves to explain your data.

If only fitting one fourier term, this would be the equivalent of fitting a sin and a cosin for your most frequently occurring lag seen in your periodogram (in our case 52 weeks). We use the fourier() function from the forecast package.

In the below code we assign using the $, as fourier() returns two columns (one for sin one for cosin) and so these are added to the dataset as a list, called “fourier” - but this list can then be used as a normal variable in regression.

## add in fourier terms using the epiweek and case_int variabless
counts$fourier <- select(counts, epiweek, case_int) %>% 
  fourier(K = 1)

Negative binomial

It is possible to fit regressions using base stats or MASS functions (e.g. lm(), glm() and glm.nb()). However we will be using those from the trending package, as this allows for calculating appropriate confidence and prediction intervals (which are otherwise not available). The syntax is the same, and you specify an outcome variable then a tilde (~) and then add your various exposure variables of interest separated by a plus (+).

The other difference is that we first define the model and then fit() it to the data. This is useful because it allows for comparing multiple different models with the same syntax.

TIP: If you wanted to use rates, rather than counts you could include the population variable as a logarithmic offset term, by adding offset(log(population). You would then need to set population to be 1, before using predict() in order to produce a rate.

TIP: For fitting more complex models such as ARIMA or prophet, see the fable package.

## define the model you want to fit (negative binomial) 
model <- glm_nb_model(
  ## set number of cases as outcome of interest
  case_int ~
    ## use epiweek to account for the trend
    epiweek +
    ## use the fourier terms to account for seasonality
    fourier)

## fit your model using the counts dataset
fitted_model <- trending::fit(model, counts)

## calculate confidence intervals and prediction intervals 
observed <- predict(fitted_model, simulate_pi = FALSE)

## plot your regression 
ggplot(data = observed, aes(x = epiweek)) + 
  ## add in a line for the model estimate
  geom_line(aes(y = estimate),
            col = "Red") + 
  ## add in a band for the prediction intervals 
  geom_ribbon(aes(ymin = lower_pi, 
                  ymax = upper_pi), 
              alpha = 0.25) + 
  ## add in a line for your observed case counts
  geom_line(aes(y = case_int), 
            col = "black") + 
  ## make a traditional plot (with black axes and white background)
  theme_classic()

Residuals

To see how well our model fits the observed data we need to look at the residuals. The residuals are the difference between the observed counts and the counts estimated from the model. We could calculate this simply by using case_int - estimate, but the residuals() function extracts this directly from the regression for us.

What we see from the below, is that we are not explaining all of the variation that we could with the model. It might be that we should fit more fourier terms, and address the amplitude. However for this example we will leave it as is. The plots show that our model does worse in the peaks and troughs (when counts are at their highest and lowest) and that it might be more likely to underestimate the observed counts.

## calculate the residuals 
observed <- observed %>% 
  mutate(resid = residuals(fitted_model$fitted_model, type = "response"))

## are the residuals fairly constant over time (if not: outbreaks? change in practice?)
observed %>%
  ggplot(aes(x = epiweek, y = resid)) +
  geom_line() +
  geom_point() + 
  labs(x = "epiweek", y = "Residuals")

## is there autocorelation in the residuals (is there a pattern to the error?)  
observed %>% 
  as_tsibble(index = epiweek) %>% 
  ACF(resid, lag_max = 52) %>% 
  autoplot()

## are residuals normally distributed (are under or over estimating?)  
observed %>%
  ggplot(aes(x = resid)) +
  geom_histogram(binwidth = 100) +
  geom_rug() +
  labs(y = "count") 

## compare observed counts to their residuals 
  ## should also be no pattern 
observed %>%
  ggplot(aes(x = estimate, y = resid)) +
  geom_point() +
  labs(x = "Fitted", y = "Residuals")

## formally test autocorrelation of the residuals
## H0 is that residuals are from a white-noise series (i.e. random)
## test for independence 
## if p value significant then non-random
Box.test(observed$resid, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  observed$resid
## X-squared = 346.64, df = 1, p-value < 2.2e-16

23.6 Relation of two time series

Here we look at using weather data (specifically the temperature) to explain campylobacter case counts.

Merging datasets

We can join our datasets using the week variable. For more on merging see the handbook section on joining.

## left join so that we only have the rows already existing in counts
## drop the date variable from temp_data (otherwise is duplicated)
counts <- left_join(counts, 
                    select(temp_data, -date),
                    by = "epiweek")

Descriptive analysis

First plot your data to see if there is any obvious relation. The plot below shows that there is a clear relation in the seasonality of the two variables, and that temperature might peak a few weeks before the case number. For more on pivoting data, see the handbook section on pivoting data.

counts %>% 
  ## keep the variables we are interested 
  select(epiweek, case_int, t2m) %>% 
  ## change your data in to long format
  pivot_longer(
    ## use epiweek as your key
    !epiweek,
    ## move column names to the new "measure" column
    names_to = "measure", 
    ## move cell values to the new "values" column
    values_to = "value") %>% 
  ## create a plot with the dataset above
  ## plot epiweek on the x axis and values (counts/celsius) on the y 
  ggplot(aes(x = epiweek, y = value)) + 
    ## create a separate plot for temperate and case counts 
    ## let them set their own y-axes
    facet_grid(measure ~ ., scales = "free_y") +
    ## plot both as a line
    geom_line()

Lags and cross-correlation

To formally test which weeks are most highly related between cases and temperature. We can use the cross-correlation function (CCF()) from the feasts package. You could also visualise (rather than using arrange) using the autoplot() function.

counts %>% 
  ## calculate cross-correlation between interpolated counts and temperature
  CCF(case_int, t2m,
      ## set the maximum lag to be 52 weeks
      lag_max = 52, 
      ## return the correlation coefficient 
      type = "correlation") %>% 
  ## arange in decending order of the correlation coefficient 
  ## show the most associated lags
  arrange(-ccf) %>% 
  ## only show the top ten 
  slice_head(n = 10)
## # A tsibble: 10 x 2 [1W]
##      lag   ccf
##    <lag> <dbl>
##  1    4W 0.749
##  2    5W 0.745
##  3    3W 0.735
##  4    6W 0.729
##  5    2W 0.727
##  6    7W 0.704
##  7    1W 0.695
##  8    8W 0.671
##  9    0W 0.649
## 10  -47W 0.638

We see from this that a lag of 4 weeks is most highly correlated, so we make a lagged temperature variable to include in our regression.

DANGER: Note that the first four weeks of our data in the lagged temperature variable are missing (NA) - as there are not four weeks prior to get data from. In order to use this dataset with the trending predict() function, we need to use the the simulate_pi = FALSE argument within predict() further down. If we did want to use the simulate option, then we have to drop these missings and store as a new data set by adding drop_na(t2m_lag4) to the code chunk below.

counts <- counts %>% 
  ## create a new variable for temperature lagged by four weeks
  mutate(t2m_lag4 = lag(t2m, n = 4))

Negative binomial with two variables

We fit a negative binomial regression as done previously. This time we add the temperature variable lagged by four weeks.

CAUTION: Note the use of simulate_pi = FALSE within the predict() argument. This is because the default behaviour of trending is to use the ciTools package to estimate a prediction interval. This does not work if there are NA counts, and also produces more granular intervals. See ?trending::predict.trending_model_fit for details.

## define the model you want to fit (negative binomial) 
model <- glm_nb_model(
  ## set number of cases as outcome of interest
  case_int ~
    ## use epiweek to account for the trend
    epiweek +
    ## use the fourier terms to account for seasonality
    fourier + 
    ## use the temperature lagged by four weeks 
    t2m_lag4
    )

## fit your model using the counts dataset
fitted_model <- trending::fit(model, counts)

## calculate confidence intervals and prediction intervals 
observed <- predict(fitted_model, simulate_pi = FALSE)

To investigate the individual terms, we can pull the original negative binomial regression out of the trending format using get_model() and pass this to the broom package tidy() function to retrieve exponentiated estimates and associated confidence intervals.

What this shows us is that lagged temperature, after controlling for trend and seasonality, is similar to the case counts (estimate ~ 1) and significantly associated. This suggests that it might be a good variable for use in predicting future case numbers (as climate forecasts are readily available).

fitted_model %>% 
  ## extract original negative binomial regression
  get_model() %>% 
  ## get a tidy dataframe of results
  tidy(exponentiate = TRUE, 
       conf.int = TRUE)
## # A tibble: 5 x 7
##   term           estimate  std.error statistic  p.value   conf.low  conf.high
##   <chr>             <dbl>      <dbl>     <dbl>    <dbl>      <dbl>      <dbl>
## 1 (Intercept)   5.83      0.108          53.8  0         5.61       6.04     
## 2 epiweek       0.0000846 0.00000774     10.9  8.13e-28  0.0000695  0.0000998
## 3 fourierS1-52 -0.285     0.0214        -13.3  1.84e-40 -0.327     -0.243    
## 4 fourierC1-52 -0.195     0.0200         -9.78 1.35e-22 -0.234     -0.157    
## 5 t2m_lag4      0.00667   0.00269         2.48 1.30e- 2  0.00139    0.0119

A quick visual inspection of the model shows that it might do a better job of estimating the observed case counts.

## plot your regression 
ggplot(data = observed, aes(x = epiweek)) + 
  ## add in a line for the model estimate
  geom_line(aes(y = estimate),
            col = "Red") + 
  ## add in a band for the prediction intervals 
  geom_ribbon(aes(ymin = lower_pi, 
                  ymax = upper_pi), 
              alpha = 0.25) + 
  ## add in a line for your observed case counts
  geom_line(aes(y = case_int), 
            col = "black") + 
  ## make a traditional plot (with black axes and white background)
  theme_classic()

Residuals

We investigate the residuals again to see how well our model fits the observed data. The results and interpretation here are similar to those of the previous regression, so it may be more feasible to stick with the simpler model without temperature.

## calculate the residuals 
observed <- observed %>% 
  mutate(resid = case_int - estimate)

## are the residuals fairly constant over time (if not: outbreaks? change in practice?)
observed %>%
  ggplot(aes(x = epiweek, y = resid)) +
  geom_line() +
  geom_point() + 
  labs(x = "epiweek", y = "Residuals")
## Warning: Removed 4 row(s) containing missing values (geom_path).
## Warning: Removed 4 rows containing missing values (geom_point).

## is there autocorelation in the residuals (is there a pattern to the error?)  
observed %>% 
  as_tsibble(index = epiweek) %>% 
  ACF(resid, lag_max = 52) %>% 
  autoplot()

## are residuals normally distributed (are under or over estimating?)  
observed %>%
  ggplot(aes(x = resid)) +
  geom_histogram(binwidth = 100) +
  geom_rug() +
  labs(y = "count") 
## Warning: Removed 4 rows containing non-finite values (stat_bin).

## compare observed counts to their residuals 
  ## should also be no pattern 
observed %>%
  ggplot(aes(x = estimate, y = resid)) +
  geom_point() +
  labs(x = "Fitted", y = "Residuals")
## Warning: Removed 4 rows containing missing values (geom_point).

## formally test autocorrelation of the residuals
## H0 is that residuals are from a white-noise series (i.e. random)
## test for independence 
## if p value significant then non-random
Box.test(observed$resid, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  observed$resid
## X-squared = 339.52, df = 1, p-value < 2.2e-16

23.7 Outbreak detection

We will demonstrate two (similar) methods of detecting outbreaks here. The first builds on the sections above. We use the trending package to fit regressions to previous years, and then predict what we expect to see in the following year. If observed counts are above what we expect, then it could suggest there is an outbreak. The second method is based on similar principles but uses the surveillance package, which has a number of different algorithms for aberration detection.

CAUTION: Normally, you are interested in the current year (where you only know counts up to the present week). So in this example we are pretending to be in week 39 of 2011.

surveillance package

In this section we use the surveillance package to create alert thresholds based on outbreak detection algorithms. There are several different methods available in the package, however we will focus on two options here. For details, see these papers on the application and theory of the alogirthms used.

The first option uses the improved Farrington method. This fits a negative binomial glm (including trend) and down-weights past outbreaks (outliers) to create a threshold level.

The second option use the glrnb method. This also fits a negative binomial glm but includes trend and fourier terms (so is favoured here). The regression is used to calculate the “control mean” (~fitted values) - it then uses a computed generalized likelihood ratio statistic to assess if there is shift in the mean for each week. Note that the threshold for each week takes in to account previous weeks so if there is a sustained shift an alarm will be triggered. (Also note that after each alarm the algorithm is reset)

In order to work with the surveillance package, we first need to define a “surveillance time series” object (using the sts() function) to fit within the framework.

## define surveillance time series object
## nb. you can include a denominator with the population object (see ?sts)
counts_sts <- sts(observed = counts$case_int[!is.na(counts$case_int)],
                  start = c(
                    ## subset to only keep the year from start_date 
                    as.numeric(str_sub(start_date, 1, 4)), 
                    ## subset to only keep the week from start_date
                    as.numeric(str_sub(start_date, 7, 8))), 
                  ## define the type of data (in this case weekly)
                  freq = 52)

## define the week range that you want to include (ie. prediction period)
## nb. the sts object only counts observations without assigning a week or 
## year identifier to them - so we use our data to define the appropriate observations
weekrange <- cut_off - start_date

Farrington method

We then define each of our parameters for the Farrington method in a list. Then we run the algorithm using farringtonFlexible() and then we can extract the threshold for an alert using farringtonmethod@upperboundto include this in our dataset. It is also possible to extract a TRUE/FALSE for each week if it triggered an alert (was above the threshold) using farringtonmethod@alarm.

## define control
ctrl <- list(
  ## define what time period that want threshold for (i.e. 2011)
  range = which(counts_sts@epoch > weekrange),
  b = 9, ## how many years backwards for baseline
  w = 2, ## rolling window size in weeks
  weightsThreshold = 2.58, ## reweighting past outbreaks (improved noufaily method - original suggests 1)
  ## pastWeeksNotIncluded = 3, ## use all weeks available (noufaily suggests drop 26)
  trend = TRUE,
  pThresholdTrend = 1, ## 0.05 normally, however 1 is advised in the improved method (i.e. always keep)
  thresholdMethod = "nbPlugin",
  populationOffset = TRUE
  )

## apply farrington flexible method
farringtonmethod <- farringtonFlexible(counts_sts, ctrl)

## create a new variable in the original dataset called threshold
## containing the upper bound from farrington 
## nb. this is only for the weeks in 2011 (so need to subset rows)
counts[which(counts$epiweek >= cut_off & 
               !is.na(counts$case_int)),
              "threshold"] <- farringtonmethod@upperbound

We can then visualise the results in ggplot as done previously.

ggplot(counts, aes(x = epiweek)) + 
  ## add in observed case counts as a line
  geom_line(aes(y = case_int, colour = "Observed")) + 
  ## add in upper bound of aberration algorithm
  geom_line(aes(y = threshold, colour = "Alert threshold"), 
            linetype = "dashed", 
            size = 1.5) +
  ## define colours
  scale_colour_manual(values = c("Observed" = "black", 
                                 "Alert threshold" = "red")) + 
  ## make a traditional plot (with black axes and white background)
  theme_classic() + 
  ## remove title of legend 
  theme(legend.title = element_blank())

GLRNB method

Similarly for the GLRNB method we define each of our parameters for the in a list, then fit the algorithm and extract the upper bounds.

CAUTION: This method uses “brute force” (similar to bootstrapping) for calculating thresholds, so can take a long time!

See the GLRNB vignette for details.

## define control options
ctrl <- list(
  ## define what time period that want threshold for (i.e. 2011)
  range = which(counts_sts@epoch > weekrange),
  mu0 = list(S = 1,    ## number of fourier terms (harmonics) to include
  trend = TRUE,   ## whether to include trend or not
  refit = FALSE), ## whether to refit model after each alarm
  ## cARL = threshold for GLR statistic (arbitrary)
     ## 3 ~ middle ground for minimising false positives
     ## 1 fits to the 99%PI of glm.nb - with changes after peaks (threshold lowered for alert)
   c.ARL = 2,
   # theta = log(1.5), ## equates to a 50% increase in cases in an outbreak
   ret = "cases"     ## return threshold upperbound as case counts
  )

## apply the glrnb method
glrnbmethod <- glrnb(counts_sts, control = ctrl, verbose = FALSE)

## create a new variable in the original dataset called threshold
## containing the upper bound from glrnb 
## nb. this is only for the weeks in 2011 (so need to subset rows)
counts[which(counts$epiweek >= cut_off & 
               !is.na(counts$case_int)),
              "threshold_glrnb"] <- glrnbmethod@upperbound

Visualise the outputs as previously.

ggplot(counts, aes(x = epiweek)) + 
  ## add in observed case counts as a line
  geom_line(aes(y = case_int, colour = "Observed")) + 
  ## add in upper bound of aberration algorithm
  geom_line(aes(y = threshold_glrnb, colour = "Alert threshold"), 
            linetype = "dashed", 
            size = 1.5) +
  ## define colours
  scale_colour_manual(values = c("Observed" = "black", 
                                 "Alert threshold" = "red")) + 
  ## make a traditional plot (with black axes and white background)
  theme_classic() + 
  ## remove title of legend 
  theme(legend.title = element_blank())

23.8 Interrupted timeseries

Interrupted timeseries (also called segmented regression or intervention analysis), is often used in assessing the impact of vaccines on the incidence of disease. But it can be used for assessing impact of a wide range of interventions or introductions. For example changes in hospital procedures or the introduction of a new disease strain to a population. In this example we will pretend that a new strain of Campylobacter was introduced to Germany at the end of 2008, and see if that affects the number of cases. We will use negative binomial regression again. The regression this time will be split in to two parts, one before the intervention (or introduction of new strain here) and one after (the pre and post-periods). This allows us to calculate an incidence rate ratio comparing the two time periods. Explaining the equation might make this clearer (if not then just ignore!).

The negative binomial regression can be defined as follows:

\[\log(Y_t)= β_0 + β_1 \times t+ β_2 \times δ(t-t_0) + β_3\times(t-t_0 )^+ + log(pop_t) + e_t\]

Where: \(Y_t\)is the number of cases observed at time \(t\)
\(pop_t\) is the population size in 100,000s at time \(t\) (not used here)
\(t_0\) is the last year of the of the pre-period (including transition time if any)
\(δ(x\) is the indicator function (it is 0 if x≤0 and 1 if x>0)
\((x)^+\) is the cut off operator (it is x if x>0 and 0 otherwise)
\(e_t\) denotes the residual Additional terms trend and season can be added as needed.

\(β_2 \times δ(t-t_0) + β_3\times(t-t_0 )^+\) is the generalised linear part of the post-period and is zero in the pre-period. This means that the \(β_2\) and \(β_3\) estimates are the effects of the intervention.

We need to re-calculate the fourier terms without forecasting here, as we will use all the data available to us (i.e. retrospectively). Additionally we need to calculate the extra terms needed for the regression.

## add in fourier terms using the epiweek and case_int variabless
counts$fourier <- select(counts, epiweek, case_int) %>% 
  as_tsibble(index = epiweek) %>% 
  fourier(K = 1)

## define intervention week 
intervention_week <- yearweek("2008-12-31")

## define variables for regression 
counts <- counts %>% 
  mutate(
    ## corresponds to t in the formula
      ## count of weeks (could probably also just use straight epiweeks var)
    # linear = row_number(epiweek), 
    ## corresponds to delta(t-t0) in the formula
      ## pre or post intervention period
    intervention = as.numeric(epiweek >= intervention_week), 
    ## corresponds to (t-t0)^+ in the formula
      ## count of weeks post intervention
      ## (choose the larger number between 0 and whatever comes from calculation)
    time_post = pmax(0, epiweek - intervention_week + 1))

We then use these terms to fit a negative binomial regression, and produce a table with percentage change. What this example shows is that there was no significant change.

CAUTION: Note the use of simulate_pi = FALSE within the predict() argument. This is because the default behaviour of trending is to use the ciTools package to estimate a prediction interval. This does not work if there are NA counts, and also produces more granular intervals. See ?trending::predict.trending_model_fit for details.

## define the model you want to fit (negative binomial) 
model <- glm_nb_model(
  ## set number of cases as outcome of interest
  case_int ~
    ## use epiweek to account for the trend
    epiweek +
    ## use the furier terms to account for seasonality
    fourier + 
    ## add in whether in the pre- or post-period 
    intervention + 
    ## add in the time post intervention 
    time_post
    )

## fit your model using the counts dataset
fitted_model <- trending::fit(model, counts)

## calculate confidence intervals and prediction intervals 
observed <- predict(fitted_model, simulate_pi = FALSE)



## show estimates and percentage change in a table
fitted_model %>% 
  ## extract original negative binomial regression
  get_model() %>% 
  ## get a tidy dataframe of results
  tidy(exponentiate = TRUE, 
       conf.int = TRUE) %>% 
  ## only keep the intervention value 
  filter(term == "intervention") %>% 
  ## change the IRR to percentage change for estimate and CIs 
  mutate(
    ## for each of the columns of interest - create a new column
    across(
      all_of(c("estimate", "conf.low", "conf.high")), 
      ## apply the formula to calculate percentage change
            .f = function(i) 100 * (i - 1), 
      ## add a suffix to new column names with "_perc"
      .names = "{.col}_perc")
    ) %>% 
  ## only keep (and rename) certain columns 
  select("IRR" = estimate, 
         "95%CI low" = conf.low, 
         "95%CI high" = conf.high,
         "Percentage change" = estimate_perc, 
         "95%CI low (perc)" = conf.low_perc, 
         "95%CI high (perc)" = conf.high_perc,
         "p-value" = p.value)
## # A tibble: 1 x 7
##       IRR `95%CI low` `95%CI high` `Percentage change` `95%CI low (perc)` `95%CI high (perc)` `p-value`
##     <dbl>       <dbl>        <dbl>               <dbl>              <dbl>               <dbl>     <dbl>
## 1 -0.0661      -0.135      0.00305               -107.              -113.               -99.7    0.0645

As previously we can visualise the outputs of the regression.

ggplot(observed, aes(x = epiweek)) + 
  ## add in observed case counts as a line
  geom_line(aes(y = case_int, colour = "Observed")) + 
  ## add in a line for the model estimate
  geom_line(aes(y = estimate, col = "Estimate")) + 
  ## add in a band for the prediction intervals 
  geom_ribbon(aes(ymin = lower_pi, 
                  ymax = upper_pi), 
              alpha = 0.25) + 
  ## add vertical line and label to show where forecasting started
  geom_vline(
           xintercept = as.Date(intervention_week), 
           linetype = "dashed") + 
  annotate(geom = "text", 
           label = "Intervention", 
           x = intervention_week, 
           y = max(observed$upper_pi), 
           angle = 90, 
           vjust = 1
           ) + 
  ## define colours
  scale_colour_manual(values = c("Observed" = "black", 
                                 "Estimate" = "red")) + 
  ## make a traditional plot (with black axes and white background)
  theme_classic()
## Warning: Removed 13 row(s) containing missing values (geom_path).

24 Epidemic modeling

24.1 Overview

There exists a growing body of tools for epidemic modelling that lets us conduct fairly complex analyses with minimal effort. This section will provide an overview on how to use these tools to:

  • estimate the effective reproduction number Rt and related statistics such as the doubling time
  • produce short-term projections of future incidence

It is not intended as an overview of the methodologies and statistical methods underlying these tools, so please refer to the Resources tab for links to some papers covering this. Make sure you have an understanding of the methods before using these tools; this will ensure you can accurately interpret their results.

Below is an example of one of the outputs we’ll be producing in this section.

24.2 Preparation

We will use two different methods and packages for Rt estimation, namely EpiNow and EpiEstim, as well as the projections package for forecasting case incidence.

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
   rio,          # File import
   here,         # File locator
   tidyverse,    # Data management + ggplot2 graphics
   epicontacts,  # Analysing transmission networks
   EpiNow2,      # Rt estimation
   EpiEstim,     # Rt estimation
   projections,  # Incidence projections
   incidence2,   # Handling incidence data
   epitrix,      # Useful epi functions
   distcrete     # Discrete delay distributions
)

We will use the cleaned case linelist for all analyses in this section. If you want to follow along, click to download the “clean” linelist (as .rds file). See the Download handbook and data page to download all example data used in this handbook.

# import the cleaned linelist
linelist <- import("linelist_cleaned.rds")

24.3 Estimating Rt

EpiNow2 vs. EpiEstim

The reproduction number R is a measure of the transmissibility of a disease and is defined as the expected number of secondary cases per infected case. In a fully susceptible population, this value represents the basic reproduction number R0. However, as the number of susceptible individuals in a population changes over the course of an outbreak or pandemic, and as various response measures are implemented, the most commonly used measure of transmissibility is the effective reproduction number Rt; this is defined as the expected number of secondary cases per infected case at a given time t.

The EpiNow2 package provides the most sophisticated framework for estimating Rt. It has two key advantages over the other commonly used package, EpiEstim:

  • It accounts for delays in reporting and can therefore estimate Rt even when recent data is incomplete.
  • It estimates Rt on dates of infection rather than the dates of onset of reporting, which means that the effect of an intervention will be immediately reflected in a change in Rt, rather than with a delay.

However, it also has two key disadvantages:

  • It requires knowledge of the generation time distribution (i.e. distribution of delays between infection of a primary and secondary cases), incubation period distribution (i.e. distribution of delays between infection and symptom onset) and any further delay distribution relevant to your data (e.g. if you have dates of reporting, you require the distribution of delays from symptom onset to reporting). While this will allow more accurate estimation of Rt, EpiEstim only requires the serial interval distribution (i.e. the distribution of delays between symptom onset of a primary and a secondary case), which may be the only distribution available to you.
  • EpiNow2 is significantly slower than EpiEstim, anecdotally by a factor of about 100-1000! For example, estimating Rt for the sample outbreak considered in this section takes about four hours (this was run for a large number of iterations to ensure high accuracy and could probably be reduced if necessary, however the points stands that the algorithm is slow in general). This may be unfeasible if you are regularly updating your Rt estimates.

Which package you choose to use will therefore depend on the data, time and computational resources available to you.

EpiNow2

Estimating delay distributions

The delay distributions required to run EpiNow2 depend on the data you have. Essentially, you need to be able to describe the delay from the date of infection to the date of the event you want to use to estimate Rt. If you are using dates of onset, this would simply be the incubation period distribution. If you are using dates of reporting, you require the delay from infection to reporting. As this distribution is unlikely to be known directly, EpiNow2 lets you chain multiple delay distributions together; in this case, the delay from infection to symptom onset (e.g. the incubation period, which is likely known) and from symptom onset to reporting (which you can often estimate from the data).

As we have the dates of onset for all our cases in the example linelist, we will only require the incubation period distribution to link our data (e.g. dates of symptom onset) to the date of infection. We can either estimate this distribution from the data or use values from the literature.

A literature estimate of the incubation period of Ebola (taken from this paper) with a mean of 9.1, standard deviation of 7.3 and maximum value of 30 would be specified as follows:

incubation_period_lit <- list(
  mean = log(9.1),
  mean_sd = log(0.1),
  sd = log(7.3),
  sd_sd = log(0.1),
  max = 30
)

Note that EpiNow2 requires these delay distributions to be provided on a log scale, hence the log call around each value (except the max parameter which, confusingly, has to be provided on a natural scale). The mean_sd and sd_sd define the standard deviation of the mean and standard deviation estimates. As these are not known in this case, we choose the fairly arbitrary value of 0.1.

In this analysis, we instead estimate the incubation period distribution from the linelist itself using the function bootstrapped_dist_fit, which will fit a lognormal distribution to the observed delays between infection and onset in the linelist.

## estimate incubation period
incubation_period <- bootstrapped_dist_fit(
  linelist$date_onset - linelist$date_infection,
  dist = "lognormal",
  max_value = 100,
  bootstraps = 1
)

The other distribution we require is the generation time. As we have data on infection times and transmission links, we can estimate this distribution from the linelist by calculating the delay between infection times of infector-infectee pairs. To do this, we use the handy get_pairwise function from the package epicontacts, which allows us to calculate pairwise differences of linelist properties between transmission pairs. We first create an epicontacts object (see Transmission chains page for further details):

## generate contacts
contacts <- linelist %>%
  transmute(
    from = infector,
    to = case_id
  ) %>%
  drop_na()

## generate epicontacts object
epic <- make_epicontacts(
  linelist = linelist,
  contacts = contacts, 
  directed = TRUE
)

We then fit the difference in infection times between transmission pairs, calculated using get_pairwise, to a gamma distribution:

## estimate gamma generation time
generation_time <- bootstrapped_dist_fit(
  get_pairwise(epic, "date_infection"),
  dist = "gamma",
  max_value = 20,
  bootstraps = 1
)

Running EpiNow2

Now we just need to calculate daily incidence from the linelist, which we can do easily with the dplyr functions group_by() and n(). Note that EpiNow2 requires the column names to be date and confirm.

## get incidence from onset dates
cases <- linelist %>%
  group_by(date = date_onset) %>%
  summarise(confirm = n())

We can then estimate Rt using the epinow function. Some notes on the inputs:

  • We can provide any number of ‘chained’ delay distributions to the delays argument; we would simply insert them alongside the incubation_period object within the delay_opts function.
  • return_output ensures the output is returned within R and not just saved to a file.
  • verbose specifies that we want a readout of the progress.
  • horizon indicates how many days we want to project future incidence for.
  • We pass additional options to the stan argument to specify how long we want to run the inference for. Increasing samples and chains will give you a more accurate estimate that better characterises uncertainty, however will take longer to run.
## run epinow
epinow_res <- epinow(
  reported_cases = cases,
  generation_time = generation_time,
  delays = delay_opts(incubation_period),
  return_output = TRUE,
  verbose = TRUE,
  horizon = 21,
  stan = stan_opts(samples = 750, chains = 4)
)

Analysing outputs

Once the code has finished running, we can plot a summary very easily as follows. Scroll the image to see the full extent.

## plot summary figure
plot(epinow_res)

We can also look at various summary statistics:

## summary table
epinow_res$summary
##                                  measure                  estimate  numeric_estimate
## 1: New confirmed cases by infection date                4 (2 -- 6) <data.table[1x9]>
## 2:        Expected change in daily cases                    Unsure              0.56
## 3:            Effective reproduction no.        0.88 (0.73 -- 1.1) <data.table[1x9]>
## 4:                        Rate of growth -0.012 (-0.028 -- 0.0052) <data.table[1x9]>
## 5:          Doubling/halving time (days)          -60 (130 -- -25) <data.table[1x9]>

For further analyses and custom plotting, you can access the summarised daily estimates via $estimates$summarised. We will convert this from the default data.table to a tibble for ease of use with dplyr.

## extract summary and convert to tibble
estimates <- as_tibble(epinow_res$estimates$summarised)
estimates

As an example, let’s make a plot of the doubling time and Rt. We will only look at the first few months of the outbreak when Rt is well above one, to avoid plotting extremely high doublings times.

We use the formula log(2)/growth_rate to calculate the doubling time from the estimated growth rate.

## make wide df for median plotting
df_wide <- estimates %>%
  filter(
    variable %in% c("growth_rate", "R"),
    date < as.Date("2014-09-01")
  ) %>%
  ## convert growth rates to doubling times
  mutate(
    across(
      c(median, lower_90:upper_90),
      ~ case_when(
        variable == "growth_rate" ~ log(2)/.x,
        TRUE ~ .x
      )
    ),
    ## rename variable to reflect transformation
    variable = replace(variable, variable == "growth_rate", "doubling_time")
  )

## make long df for quantile plotting
df_long <- df_wide %>%
  ## here we match matching quantiles (e.g. lower_90 to upper_90)
  pivot_longer(
    lower_90:upper_90,
    names_to = c(".value", "quantile"),
    names_pattern = "(.+)_(.+)"
  )

## make plot
ggplot() +
  geom_ribbon(
    data = df_long,
    aes(x = date, ymin = lower, ymax = upper, alpha = quantile),
    color = NA
  ) +
  geom_line(
    data = df_wide,
    aes(x = date, y = median)
  ) +
  ## use label_parsed to allow subscript label
  facet_wrap(
    ~ variable,
    ncol = 1,
    scales = "free_y",
    labeller = as_labeller(c(R = "R[t]", doubling_time = "Doubling~time"), label_parsed),
    strip.position = 'left'
  ) +
  ## manually define quantile transparency
  scale_alpha_manual(
    values = c(`20` = 0.7, `50` = 0.4, `90` = 0.2),
    labels = function(x) paste0(x, "%")
  ) +
  labs(
    x = NULL,
    y = NULL,
    alpha = "Credibel\ninterval"
  ) +
  scale_x_date(
    date_breaks = "1 month",
    date_labels = "%b %d\n%Y"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    strip.background = element_blank(),
    strip.placement = 'outside'
  )

EpiEstim

To run EpiEstim, we need to provide data on daily incidence and specify the serial interval (i.e. the distribution of delays between symptom onset of primary and secondary cases).

Incidence data can be provided to EpiEstim as a vector, a data frame, or an incidence object from the original incidence package. You can even distinguish between imports and locally acquired infections; see the documentation at ?estimate_R for further details.

We will create the input using incidence2. See the page on Epidemic curves for more examples with the incidence2 package. Since there have been updates to the incidence2 package that don’t completely align with estimateR()’s expected input, there are some minor additional steps needed. The incidence object consists of a tibble with dates and their respective case counts. We use complete() from tidyr to ensure all dates are included (even those with no cases), and then rename() the columns to align with what is expected by estimate_R() in a later step.

## get incidence from onset date
cases <- incidence2::incidence(linelist, date_index = date_onset) %>% # get case counts by day
  tidyr::complete(date_index = seq.Date(                              # ensure all dates are represented
    from = min(date_index, na.rm = T),
    to = max(date_index, na.rm=T),
    by = "day"),
    fill = list(count = 0)) %>%                                       # convert NA counts to 0
  rename(I = count,                                                   # rename to names expected by estimateR
         dates = date_index)
## 256 missing observations were removed.

The package provides several options for specifying the serial interval, the details of which are provided in the documentation at ?estimate_R. We will cover two of them here.

Using serial interval estimates from the literature

Using the option method = "parametric_si", we can manually specify the mean and standard deviation of the serial interval in a config object created using the function make_config. We use a mean and standard deviation of 12.0 and 5.2, respectively, defined in this paper:

## make config
config_lit <- make_config(
  mean_si = 12.0,
  std_si = 5.2
)

We can then estimate Rt with the estimate_R function:

epiestim_res_lit <- estimate_R(
  incid = cases,
  method = "parametric_si",
  config = config_lit
)
## Default config will estimate R on weekly sliding windows.
##     To change this change the t_start and t_end arguments.

and plot a summary of the outputs:

plot(epiestim_res_lit)

Using serial interval estimates from the data

As we have data on dates of symptom onset and transmission links, we can also estimate the serial interval from the linelist by calculating the delay between onset dates of infector-infectee pairs. As we did in the EpiNow2 section, we will use the get_pairwise function from the epicontacts package, which allows us to calculate pairwise differences of linelist properties between transmission pairs. We first create an epicontacts object (see Transmission chains page for further details):

## generate contacts
contacts <- linelist %>%
  transmute(
    from = infector,
    to = case_id
  ) %>%
  drop_na()

## generate epicontacts object
epic <- make_epicontacts(
  linelist = linelist,
  contacts = contacts, 
  directed = TRUE
)

We then fit the difference in onset dates between transmission pairs, calculated using get_pairwise, to a gamma distribution. We use the handy fit_disc_gamma from the epitrix package for this fitting procedure, as we require a discretised distribution.

## estimate gamma serial interval
serial_interval <- fit_disc_gamma(get_pairwise(epic, "date_onset"))

We then pass this information to the config object, run EpiEstim again and plot the results:

## make config
config_emp <- make_config(
  mean_si = serial_interval$mu,
  std_si = serial_interval$sd
)

## run epiestim
epiestim_res_emp <- estimate_R(
  incid = cases,
  method = "parametric_si",
  config = config_emp
)
## Default config will estimate R on weekly sliding windows.
##     To change this change the t_start and t_end arguments.
## plot outputs
plot(epiestim_res_emp)

Specifying estimation time windows

These default options will provide a weekly sliding estimate and might act as a warning that you are estimating Rt too early in the outbreak for a precise estimate. You can change this by setting a later start date for the estimation as shown below. Unfortunately, EpiEstim only provides a very clunky way of specifying these estimations times, in that you have to provide a vector of integers referring to the start and end dates for each time window.

## define a vector of dates starting on June 1st
start_dates <- seq.Date(
  as.Date("2014-06-01"),
  max(cases$dates) - 7,
  by = 1
) %>%
  ## subtract the starting date to convert to numeric
  `-`(min(cases$dates)) %>%
  ## convert to integer
  as.integer()

## add six days for a one week sliding window
end_dates <- start_dates + 6
  
## make config
config_partial <- make_config(
  mean_si = 12.0,
  std_si = 5.2,
  t_start = start_dates,
  t_end = end_dates
)

Now we re-run EpiEstim and can see that the estimates only start from June:

## run epiestim
epiestim_res_partial <- estimate_R(
  incid = cases,
  method = "parametric_si",
  config = config_partial
)

## plot outputs
plot(epiestim_res_partial)
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead.

Analysing outputs

The main outputs can be accessed via $R. As an example, we will create a plot of Rt and a measure of “transmission potential” given by the product of Rt and the number of cases reported on that day; this represents the expected number of cases in the next generation of infection.

## make wide dataframe for median
df_wide <- epiestim_res_lit$R %>%
  rename_all(clean_labels) %>%
  rename(
    lower_95_r = quantile_0_025_r,
    lower_90_r = quantile_0_05_r,
    lower_50_r = quantile_0_25_r,
    upper_50_r = quantile_0_75_r,
    upper_90_r = quantile_0_95_r,
    upper_95_r = quantile_0_975_r,
    ) %>%
  mutate(
    ## extract the median date from t_start and t_end
    dates = epiestim_res_emp$dates[round(map2_dbl(t_start, t_end, median))],
    var = "R[t]"
  ) %>%
  ## merge in daily incidence data
  left_join(cases, "dates") %>%
  ## calculate risk across all r estimates
  mutate(
    across(
      lower_95_r:upper_95_r,
      ~ .x*I,
      .names = "{str_replace(.col, '_r', '_risk')}"
    )
  ) %>%
  ## seperate r estimates and risk estimates
  pivot_longer(
    contains("median"),
    names_to = c(".value", "variable"),
    names_pattern = "(.+)_(.+)"
  ) %>%
  ## assign factor levels
  mutate(variable = factor(variable, c("risk", "r")))

## make long dataframe from quantiles
df_long <- df_wide %>%
  select(-variable, -median) %>%
  ## seperate r/risk estimates and quantile levels
  pivot_longer(
    contains(c("lower", "upper")),
    names_to = c(".value", "quantile", "variable"),
    names_pattern = "(.+)_(.+)_(.+)"
  ) %>%
  mutate(variable = factor(variable, c("risk", "r")))

## make plot
ggplot() +
  geom_ribbon(
    data = df_long,
    aes(x = dates, ymin = lower, ymax = upper, alpha = quantile),
    color = NA
  ) +
  geom_line(
    data = df_wide,
    aes(x = dates, y = median),
    alpha = 0.2
  ) +
  ## use label_parsed to allow subscript label
  facet_wrap(
    ~ variable,
    ncol = 1,
    scales = "free_y",
    labeller = as_labeller(c(r = "R[t]", risk = "Transmission~potential"), label_parsed),
    strip.position = 'left'
  ) +
  ## manually define quantile transparency
  scale_alpha_manual(
    values = c(`50` = 0.7, `90` = 0.4, `95` = 0.2),
    labels = function(x) paste0(x, "%")
  ) +
  labs(
    x = NULL,
    y = NULL,
    alpha = "Credible\ninterval"
  ) +
  scale_x_date(
    date_breaks = "1 month",
    date_labels = "%b %d\n%Y"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    strip.background = element_blank(),
    strip.placement = 'outside'
  )

24.4 Projecting incidence

EpiNow2

Besides estimating Rt, EpiNow2 also supports forecasting of Rt and projections of case numbers by integration with the EpiSoon package under the hood. All you need to do is specify the horizon argument in your epinow function call, indicating how many days you want to project into the future; see the EpiNow2 section under the “Estimating Rt” for details on how to get EpiNow2 up and running. In this section, we will just plot the outputs from that analysis, stored in the epinow_res object.

## define minimum date for plot
min_date <- as.Date("2015-03-01")

## extract summarised estimates
estimates <-  as_tibble(epinow_res$estimates$summarised)

## extract raw data on case incidence
observations <- as_tibble(epinow_res$estimates$observations) %>%
  filter(date > min_date)

## extract forecasted estimates of case numbers
df_wide <- estimates %>%
  filter(
    variable == "reported_cases",
    type == "forecast",
    date > min_date
  )

## convert to even longer format for quantile plotting
df_long <- df_wide %>%
  ## here we match matching quantiles (e.g. lower_90 to upper_90)
  pivot_longer(
    lower_90:upper_90,
    names_to = c(".value", "quantile"),
    names_pattern = "(.+)_(.+)"
  )

## make plot
ggplot() +
  geom_histogram(
    data = observations,
    aes(x = date, y = confirm),
    stat = 'identity',
    binwidth = 1
  ) +
  geom_ribbon(
    data = df_long,
    aes(x = date, ymin = lower, ymax = upper, alpha = quantile),
    color = NA
  ) +
  geom_line(
    data = df_wide,
    aes(x = date, y = median)
  ) +
  geom_vline(xintercept = min(df_long$date), linetype = 2) +
  ## manually define quantile transparency
  scale_alpha_manual(
    values = c(`20` = 0.7, `50` = 0.4, `90` = 0.2),
    labels = function(x) paste0(x, "%")
  ) +
  labs(
    x = NULL,
    y = "Daily reported cases",
    alpha = "Credible\ninterval"
  ) +
  scale_x_date(
    date_breaks = "1 month",
    date_labels = "%b %d\n%Y"
  ) +
  theme_minimal(base_size = 14)

projections

The projections package developed by RECON makes it very easy to make short term incidence forecasts, requiring only knowledge of the effective reproduction number Rt and the serial interval. Here we will cover how to use serial interval estimates from the literature and how to use our own estimates from the linelist.

Using serial interval estimates from the literature

projections requires a discretised serial interval distribution of the class distcrete from the package distcrete. We will use a gamma distribution with a mean of 12.0 and and standard deviation of 5.2 defined in this paper. To convert these values into the shape and scale parameters required for a gamma distribution, we will use the function gamma_mucv2shapescale from the epitrix package.

## get shape and scale parameters from the mean mu and the coefficient of
## variation (e.g. the ratio of the standard deviation to the mean)
shapescale <- epitrix::gamma_mucv2shapescale(mu = 12.0, cv = 5.2/12)

## make distcrete object
serial_interval_lit <- distcrete::distcrete(
  name = "gamma",
  interval = 1,
  shape = shapescale$shape,
  scale = shapescale$scale
)

Here is a quick check to make sure the serial interval looks correct. We access the density of the gamma distribution we have just defined by $d, which is equivalent to calling dgamma:

## check to make sure the serial interval looks correct
qplot(
  x = 0:50, y = serial_interval_lit$d(0:50), geom = "area",
  xlab = "Serial interval", ylab = "Density"
)

Using serial interval estimates from the data

As we have data on dates of symptom onset and transmission links, we can also estimate the serial interval from the linelist by calculating the delay between onset dates of infector-infectee pairs. As we did in the EpiNow2 section, we will use the get_pairwise function from the epicontacts package, which allows us to calculate pairwise differences of linelist properties between transmission pairs. We first create an epicontacts object (see Transmission chains page for further details):

## generate contacts
contacts <- linelist %>%
  transmute(
    from = infector,
    to = case_id
  ) %>%
  drop_na()

## generate epicontacts object
epic <- make_epicontacts(
  linelist = linelist,
  contacts = contacts, 
  directed = TRUE
)

We then fit the difference in onset dates between transmission pairs, calculated using get_pairwise, to a gamma distribution. We use the handy fit_disc_gamma from the epitrix package for this fitting procedure, as we require a discretised distribution.

## estimate gamma serial interval
serial_interval <- fit_disc_gamma(get_pairwise(epic, "date_onset"))

## inspect estimate
serial_interval[c("mu", "sd")]
## $mu
## [1] 11.51242
## 
## $sd
## [1] 7.700005

Projecting incidence

To project future incidence, we still need to provide historical incidence in the form of an incidence object, as well as a sample of plausible Rt values. We will generate these values using the Rt estimates generated by EpiEstim in the previous section (under “Estimating Rt”) and stored in the epiestim_res_emp object. In the code below, we extract the mean and standard deviation estimates of Rt for the last time window of the outbreak (using the tail function to access the last element in a vector), and simulate 1000 values from a gamma distribution using rgamma. You can also provide your own vector of Rt values that you want to use for forward projections.

## create incidence object from dates of onset
inc <- incidence::incidence(linelist$date_onset)
## 256 missing observations were removed.
## extract plausible r values from most recent estimate
mean_r <- tail(epiestim_res_emp$R$`Mean(R)`, 1)
sd_r <- tail(epiestim_res_emp$R$`Std(R)`, 1)
shapescale <- gamma_mucv2shapescale(mu = mean_r, cv = sd_r/mean_r)
plausible_r <- rgamma(1000, shape = shapescale$shape, scale = shapescale$scale)

## check distribution
qplot(x = plausible_r, geom = "histogram", xlab = expression(R[t]), ylab = "Counts")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We then use the project() function to make the actual forecast. We specify how many days we want to project for via the n_days arguments, and specify the number of simulations using the n_sim argument.

## make projection
proj <- project(
  x = inc,
  R = plausible_r,
  si = serial_interval$distribution,
  n_days = 21,
  n_sim = 1000
)

We can then handily plot the incidence and projections using the plot() and add_projections() functions. We can easily subset the incidence object to only show the most recent cases by using the square bracket operator.

## plot incidence and projections
plot(inc[inc$dates > as.Date("2015-03-01")]) %>%
  add_projections(proj)

You can also easily extract the raw estimates of daily case numbers by converting the output to a dataframe.

## convert to data frame for raw data
proj_df <- as.data.frame(proj)
proj_df

24.5 Resources

25 Contact tracing

This page demonstrates descriptive analysis of contact tracing data, addessing some key considerations and approaches unique to these kinds of data.

This page references many of the core R data management and visualisation competencies covered in other pages (e.g. data cleaning, pivoting, tables, time-series analyses), but we will highlight examples specific to contact tracing that have been useful for operational decision making. For example, this includes visualizing contact tracing follow-up data over time or across geographic areas, or producing clean Key Performance Indicator (KPI) tables for contact tracing supervisors.

For demonstration purposes we will use sample contact tracing data from the Go.Data platform. The principles covered here will apply for contact tracing data from other platforms - you may just need to undergo different data pre-processing steps depending on the structure of your data.

You can read more about the Go.Data project on the Github Documentation site or Community of Practice.

25.1 Preparation

Load packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
  rio,          # importing data  
  here,         # relative file pathways  
  janitor,      # data cleaning and tables
  lubridate,    # working with dates
  epikit,       # age_categories() function
  apyramid,     # age pyramids
  tidyverse,    # data manipulation and visualization
  RColorBrewer, # color palettes
  formattable,  # fancy tables
  kableExtra    # table formatting
)

Import data

We will import sample datasets of contacts, and of their “follow-up”. These data have been retrieved and un-nested from the Go.Data API and stored as “.rds” files.

You can download all the example data for this handbook from the Download handbook and data page.

If you want to download the example contact tracing data specific to this page, use the three download links below:

Click to download the case investigation data (.rds file)

Click to download the contact registration data (.rds file)

Click to download the contact follow-up data (.rds file)

In their original form in the downloadable files, the data reflect data as provided by the Go.Data API (learn about APIs here). For example purposes here, we will clean the data to make it easier to read on this page. If you are using a Go.Data instance, you can view complete instructions on how to retrieve your data here.

Below, the datasets are imported using the import() function from the rio package. See the page on Import and export for various ways to import data. We use here() to specify the file path - you should provide the file path specific to your computer. We then use select() to select only certain columns of the data, to simplify for purposes of demonstration.

Case data

These data are a table of the cases, and information about them.

cases <- import(here("data", "godata", "cases_clean.rds")) %>% 
  select(case_id, firstName, lastName, gender, age, age_class,
         occupation, classification, was_contact, hospitalization_typeid)

Here are the nrow(cases) cases:

Contacts data

These data are a table of all the contacts and information about them. Again, provide your own file path. After importing we perform a few preliminary data cleaning steps including:

  • Set age_class as a factor and reverse the level order so that younger ages are first
  • Select only certain column, while re-naming a one of them
  • Artificially assign rows with missing admin level 2 to “Djembe”, to improve clarity of some example visualisations
contacts <- import(here("data", "godata", "contacts_clean.rds")) %>% 
  mutate(age_class = forcats::fct_rev(age_class)) %>% 
  select(contact_id, contact_status, firstName, lastName, gender, age,
         age_class, occupation, date_of_reporting, date_of_data_entry,
         date_of_last_exposure = date_of_last_contact,
         date_of_followup_start, date_of_followup_end, risk_level, was_case, admin_2_name) %>% 
  mutate(admin_2_name = replace_na(admin_2_name, "Djembe"))

Here are the nrow(contacts) rows of the contacts dataset:

Follow-up data

These data are records of the “follow-up” interactions with the contacts. Each contact is supposed to have an encounter each day for 14 days after their exposure.

We import and perform a few cleaning steps. We select certain columns, and also convert a character column to all lowercase values.

followups <- rio::import(here::here("data", "godata", "followups_clean.rds")) %>% 
  select(contact_id, followup_status, followup_number,
         date_of_followup, admin_2_name, admin_1_name) %>% 
  mutate(followup_status = str_to_lower(followup_status))

Here are the first 50 rows of the nrow(followups)-row followups dataset (each row is a follow-up interaction, with outcome status in the followup_status column):

Relationships data

Here we import data showing the relationship between cases and contacts. We select certain column to show.

relationships <- rio::import(here::here("data", "godata", "relationships_clean.rds")) %>% 
  select(source_visualid, source_gender, source_age, date_of_last_contact,
         date_of_data_entry, target_visualid, target_gender,
         target_age, exposure_type)

Below are the first 50 rows of the relationships dataset, which records all relationships between cases and contacts.

25.2 Descriptive analyses

You can use the techniques covered in other pages of this handbook to conduct descriptive analyses of your cases, contacts, and their relationships. Below are some examples.

Demographics

As demonstrated in the page covering Demographic pyramids, you can visualise the age and gender distribution (here we use the apyramid package).

Age and Gender of contacts

The pyramid below compares the age distribution of contacts, by gender. Note that contacts missing age are included in their own bar at the top. You can change this default behavior, but then consider listing the number missing in a caption.

apyramid::age_pyramid(
  data = contacts,                                   # use contacts dataset
  age_group = "age_class",                           # categorical age column
  split_by = "gender") +                             # gender for halfs of pyramid
  labs(
    fill = "Gender",                                 # title of legend
    title = "Age/Sex Pyramid of COVID-19 contacts")+ # title of the plot
  theme_minimal()                                    # simple background

With the Go.Data data structure, the relationships dataset contains the ages of both cases and contacts, so you could use that dataset and create an age pyramid showing the differences between these two groups of people. The relationships data frame will be mutated to transform the numberic age columns into categories (see the Cleaning data and core functions page). We also pivot the dataframe longer to facilitate easy plotting with ggplot2 (see Pivoting data).

relation_age <- relationships %>% 
  select(source_age, target_age) %>% 
  transmute(                              # transmute is like mutate() but removes all other columns not mentioned
    source_age_class = epikit::age_categories(source_age, breakers = seq(0, 80, 5)),
    target_age_class = epikit::age_categories(target_age, breakers = seq(0, 80, 5)),
    ) %>% 
  pivot_longer(cols = contains("class"), names_to = "category", values_to = "age_class")  # pivot longer


relation_age
## # A tibble: 200 x 2
##    category         age_class
##    <chr>            <fct>    
##  1 source_age_class 80+      
##  2 target_age_class 15-19    
##  3 source_age_class <NA>     
##  4 target_age_class 50-54    
##  5 source_age_class <NA>     
##  6 target_age_class 20-24    
##  7 source_age_class 30-34    
##  8 target_age_class 45-49    
##  9 source_age_class 40-44    
## 10 target_age_class 30-34    
## # ... with 190 more rows

Now we can plot this transformed dataset with age_pyramid() as before, but replacing gender with category (contact, or case).

apyramid::age_pyramid(
  data = relation_age,                               # use modified relationship dataset
  age_group = "age_class",                           # categorical age column
  split_by = "category") +                           # by cases and contacts
  scale_fill_manual(
    values = c("orange", "purple"),                  # to specify colors AND labels
    labels = c("Case", "Contact"))+
  labs(
    fill = "Legend",                                           # title of legend
    title = "Age/Sex Pyramid of COVID-19 contacts and cases")+ # title of the plot
  theme_minimal()                                              # simple background

We can also view other characteristics such as occupational breakdown (e.g. in form of a pie chart).

# Clean dataset and get counts by occupation
occ_plot_data <- cases %>% 
  mutate(occupation = forcats::fct_explicit_na(occupation),  # make NA missing values a category
         occupation = forcats::fct_infreq(occupation)) %>%   # order factor levels in order of frequency
  count(occupation)                                          # get counts by occupation
  
# Make pie chart
ggplot(data = occ_plot_data, mapping = aes(x = "", y = n, fill = occupation))+
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start = 0) +
  labs(
    fill = "Occupation",
    title = "Known occupations of COVID-19 cases")+
  theme_minimal() +                    
  theme(axis.line = element_blank(),
        axis.title = element_blank(),
        axis.text = element_blank())

Contacts per case

The number of contacts per case can be an important metric to assess quality of contact enumeration and the compliance of the population toward public health response.

Depending on your data structure, this can be assessed with a dataset that contains all cases and contacts. In the Go.Data datasets, the links between cases (“sources”) and contacts (“targets”) is stored in the relationships dataset.

In this dataset, each row is a contact, and the source case is listed in the row. There are no contacts who have relationships with multiple cases, but if this exists you may need to account for those before plotting (and explore them too!).

We begin by counting the number of rows (contacts) per source case. This is saved as a data frame.

contacts_per_case <- relationships %>% 
  count(source_visualid)

contacts_per_case
## # A tibble: 23 x 2
##    source_visualid     n
##    <chr>           <int>
##  1 CASE-2020-0001     13
##  2 CASE-2020-0002      5
##  3 CASE-2020-0003      2
##  4 CASE-2020-0004      4
##  5 CASE-2020-0005      5
##  6 CASE-2020-0006      3
##  7 CASE-2020-0008      3
##  8 CASE-2020-0009      3
##  9 CASE-2020-0010      3
## 10 CASE-2020-0012      3
## # ... with 13 more rows

We use geom_histogram() to plot these data as a histogram.

ggplot(data = contacts_per_case)+        # begin with count data frame created above
  geom_histogram(mapping = aes(x = n))+  # print histogram of number of contacts per case
  scale_y_continuous(expand = c(0,0))+   # remove excess space below 0 on y-axis
  theme_light()+                         # simplify background
  labs(
    title = "Number of contacts per case",
    y = "Cases",
    x = "Contacts per case"
  )

25.3 Contact Follow Up

Contact tracing data often contain “follow-up” data, which record outcomes of daily symptom checks of persons in quarantine. Analysis of this data can inform response strategy, identify contacts at-risk of loss-to-follow-up or at-risk of developing disease.

Data cleaning

These data can exist in a variety of formats. They may exist as a “wide” format Excel sheet with one row per contact, and one column per follow-up “day”. See Pivoting data for descriptions of “long” and “wide” data and how to pivot data wider or longer.

In our Go.Data example, these data are stored in the followups data frame, which is in a “long” format with one row per follow-up interaction. The first 50 rows look like this:

CAUTION: Beware of duplicates when dealing with followup data; as there could be several erroneous followups on the same day for a given contact. Perhaps it seems to be an error but reflects reality - e.g. a contact tracer could submit a follow-up form early in the day when they could not reach the contact, and submit a second form when they were later reached. It will depend on the operational context for how you want to handle duplicates - just make sure to document your approach clearly.

Let’s see how many instances of “duplicate” rows we have:

followups %>% 
  count(contact_id, date_of_followup) %>%   # get unique contact_days
  filter(n > 1)                             # view records where count is more than 1  
## # A tibble: 3 x 3
##   contact_id date_of_followup     n
##   <chr>      <date>           <int>
## 1 <NA>       2020-09-03           2
## 2 <NA>       2020-09-04           2
## 3 <NA>       2020-09-05           2

In our example data, the only records that this applies to are ones missing an ID! We can remove those. But, for purposes of demonstration we will go show the steps for de-duplication so there is only one follow-up encoutner per person per day. See the page on De-duplication for more detail. We will assume that the most recent encounter record is the correct one. We also take the opportunity to clean the followup_number column (the “day” of follow-up which should range 1 - 14).

followups_clean <- followups %>%
  
  # De-duplicate
  group_by(contact_id, date_of_followup) %>%        # group rows per contact-day
  arrange(contact_id, desc(date_of_followup)) %>%   # arrange rows, per contact-day, by date of follow-up (most recent at top)
  slice_head() %>%                                  # keep only the first row per unique contact id  
  ungroup() %>% 
  
  # Other cleaning
  mutate(followup_number = replace(followup_number, followup_number > 14, NA)) %>% # clean erroneous data
  drop_na(contact_id)                               # remove rows with missing contact_id

For each follow-up encounter, we have a follow-up status (such as whether the encounter occurred and if so, did the contact have symptoms or not). To see all the values we can run a quick tabyl() (from janitor) or table() (from base R) (see Descriptive tables) by followup_status to see the frequency of each of the outcomes.

In this dataset, “seen_not_ok” means “seen with symptoms”, and “seen_ok” means “seen without symptoms”.

followups_clean %>% 
  tabyl(followup_status)
##  followup_status   n    percent
##           missed  10 0.02325581
##    not_attempted   5 0.01162791
##    not_performed 319 0.74186047
##      seen_not_ok   6 0.01395349
##          seen_ok  90 0.20930233

Plot over time

As the dates data are continuous, we will use a histogram to plot them with date_of_followup assigned to the x-axis. We can achieve a “stacked” histogram by specifying a fill = argument within aes(), which we assign to the column followup_status. Consequently, you can set the legend title using the fill = argument of labs().

We can see that the contacts were identified in waves (presumably corresponding with epidemic waves of cases), and that follow-up completion did not seemingly improve over the course of the epidemic.

ggplot(data = followups_clean)+
  geom_histogram(mapping = aes(x = date_of_followup, fill = followup_status)) +
  scale_fill_discrete(drop = FALSE)+   # show all factor levels (followup_status) in the legend, even those not used
  theme_classic() +
  labs(
    x = "",
    y = "Number of contacts",
    title = "Daily Contact Followup Status",
    fill = "Followup Status",
    subtitle = str_glue("Data as of {max(followups$date_of_followup, na.rm=T)}"))   # dynamic subtitle

CAUTION: If you are preparing many plots (e.g. for multiple jurisdictions) you will want the legends to appear identically even with varying levels of data completion or data composition. There may be plots for which not all follow-up statuses are present in the data, but you still want those categories to appear the legends. In ggplots (like above), you can specify the drop = FALSE argument of the scale_fill_discrete(). In tables, use tabyl() which shows counts for all factor levels, or if using count() from dplyr add the argument .drop = FALSE to include counts for all factor levels.

Daily individual tracking

If your outbreak is small enough, you may want to look at each contact individually and see their status over the course of their follow-up. Fortunately, this followups dataset already contains a column with the day “number” of follow-up (1-14). If this does not exist in your data, you could create it by calculating the difference between the encounter date and the date follow-up was intended to begin for the contact.

A convenient visualisation mechanism (if the number of cases is not too large) can be a heat plot, made with geom_tile(). See more details in the [heat plot] page.

ggplot(data = followups_clean)+
  geom_tile(mapping = aes(x = followup_number, y = contact_id, fill = followup_status),
            color = "grey")+       # grey gridlines
  scale_fill_manual( values = c("yellow", "grey", "orange", "darkred", "darkgreen"))+
  theme_minimal()+
  scale_x_continuous(breaks = seq(from = 1, to = 14, by = 1))

Analyse by group

Perhaps these follow-up data are being viewed on a daily or weekly basis for operational decision-making. You may want more meaningful disaggregations by geographic area or by contact-tracing team. We can do this by adjusting the columns provided to group_by().

plot_by_region <- followups_clean %>%                                        # begin with follow-up dataset
  count(admin_1_name, admin_2_name, followup_status) %>%   # get counts by unique region-status (creates column 'n' with counts)
  
  # begin ggplot()
  ggplot(                                         # begin ggplot
    mapping = aes(x = reorder(admin_2_name, n),     # reorder admin factor levels by the numeric values in column 'n'
                  y = n,                            # heights of bar from column 'n'
                  fill = followup_status,           # color stacked bars by their status
                  label = n))+                      # to pass to geom_label()              
  geom_col()+                                     # stacked bars, mapping inherited from above 
  geom_text(                                      # add text, mapping inherited from above
    size = 3,                                         
    position = position_stack(vjust = 0.5), 
    color = "white",           
    check_overlap = TRUE,
    fontface = "bold")+
  coord_flip()+
  labs(
    x = "",
    y = "Number of contacts",
    title = "Contact Followup Status, by Region",
    fill = "Followup Status",
    subtitle = str_glue("Data as of {max(followups_clean$date_of_followup, na.rm=T)}")) +
  theme_classic()+                                                                      # Simplify background
  facet_wrap(~admin_1_name, strip.position = "right", scales = "free_y", ncol = 1)      # introduce facets 

plot_by_region

25.4 KPI Tables

There are a number of different Key Performance Indicators (KPIs) that can be calculated and tracked at varying levels of disaggregations and across different time periods to monitor contact tracing performance. Once you have the calculations down and the basic table format; it is fairly easy to swap in and out different KPIs.

There are numerous sources of contact tracing KPIs, such as this one from ResolveToSaveLives.org. The majority of the work will be walking through your data structure and thinking through all of the inclusion/exclusion criteria. We show a few examples below; using Go.Data metadata structure:

Category Indicator Go.Data Numerator Go.Data Denominator
Process Indicator - Speed of Contact Tracing % cases interviewed and isolated within 24h of case report COUNT OF case_id WHERE (date_of_reporting - date_of_data_entry) < 1 day AND (isolation_startdate - date_of_data_entry) < 1 day COUNT OF case_id
Process Indicator - Speed of Contact Tracing % contacts notified and quarantined within 24h of elicitation COUNT OF contact_id WHERE followup_status == “SEEN_NOT_OK” OR “SEEN_OK” AND date_of_followup - date_of_reporting < 1 day COUNT OF contact_id
Process Indicator - Completeness of Testing % new symptomatic cases tested and interviewed within 3 days of onset of symptoms COUNT OF case_id WHERE (date_of_reporting - date_of_onset) < =3 days COUNT OF case_id
Outcome Indicator - Overall % new cases among existing contact list COUNT OF case_id WHERE was_contact == “TRUE” COUNT OF case_id

Below we will walk through a sample exercise of creating a nice table visual to show contact follow-up across admin areas. At the end, we will make it fit for presentation with the formattable package (but you could use other packages like flextable - see Tables for presentation).

How you create a table like this will depend on the structure of your contact tracing data. Use the Descriptive tables page to learn how to summarise data using dplyr functions.

We will create a table that will be dynamic and change as the data change. To make the results interesting, we will set a report_date to allow us to simulate running the table on a certain day (we pick 10th June 2020). The data are filtered to that date.

# Set "Report date" to simulate running the report with data "as of" this date
report_date <- as.Date("2020-06-10")

# Create follow-up data to reflect the report date.
table_data <- followups_clean %>% 
  filter(date_of_followup <= report_date)

Now, based on our data structure, we will do the following:

  1. Begin with the followups data and summarise it to contain, for each unique contact:
  • The date of latest record (no matter the status of the encounter)
  • The date of latest encounter where the contact was “seen”
  • The encounter status at that final “seen” encounter (e.g. with symptoms, without symptoms)
  1. Join these data to the contacts data, which contains other information such as the overall contact status, date of last exposure to a case, etc. Also we will calculate metrics of interest for each contact such as days since last exposure
  2. We group the enhanced contact data by geographic region (admin_2_name) and calculate summary statistics per region
  3. Finally, we format the table nicely for presentation

First we summarise the follow-up data to get the information of interest:

followup_info <- table_data %>% 
  group_by(contact_id) %>% 
  summarise(
    date_last_record   = max(date_of_followup, na.rm=T),
    date_last_seen     = max(date_of_followup[followup_status %in% c("seen_ok", "seen_not_ok")], na.rm=T),
    status_last_record = followup_status[which(date_of_followup == date_last_record)]) %>% 
  ungroup()

Here is how these data look:

Now we will add this information to the contacts dataset, and calculate some additional columns.

contacts_info <- followup_info %>% 
  right_join(contacts, by = "contact_id") %>% 
  mutate(
    database_date       = max(date_last_record, na.rm=T),
    days_since_seen     = database_date - date_last_seen,
    days_since_exposure = database_date - date_of_last_exposure
    )

Here is how these data look. Note contacts column to the right, and new calculated column at the far right.

Next we summarise the contacts data by region, to achieve a concise data frame of summary statistic columns.

contacts_table <- contacts_info %>% 
  
  group_by(`Admin 2` = admin_2_name) %>%
  
  summarise(
    `Registered contacts` = n(),
    `Active contacts`     = sum(contact_status == "UNDER_FOLLOW_UP", na.rm=T),
    `In first week`       = sum(days_since_exposure < 8, na.rm=T),
    `In second week`      = sum(days_since_exposure >= 8 & days_since_exposure < 15, na.rm=T),
    `Became case`         = sum(contact_status == "BECAME_CASE", na.rm=T),
    `Lost to follow up`   = sum(days_since_seen >= 3, na.rm=T),
    `Never seen`          = sum(is.na(date_last_seen)),
    `Followed up - signs` = sum(status_last_record == "Seen_not_ok" & date_last_record == database_date, na.rm=T),
    `Followed up - no signs` = sum(status_last_record == "Seen_ok" & date_last_record == database_date, na.rm=T),
    `Not Followed up`     = sum(
      (status_last_record == "NOT_ATTEMPTED" | status_last_record == "NOT_PERFORMED") &
        date_last_record == database_date, na.rm=T)) %>% 
    
  arrange(desc(`Registered contacts`))

And now we apply styling from the formattable and knitr packages, including a footnote that shows the “as of” date.

contacts_table %>%
  mutate(
    `Admin 2` = formatter("span", style = ~ formattable::style(
      color = ifelse(`Admin 2` == NA, "red", "grey"),
      font.weight = "bold",font.style = "italic"))(`Admin 2`),
    `Followed up - signs`= color_tile("white", "orange")(`Followed up - signs`),
    `Followed up - no signs`= color_tile("white", "#A0E2BD")(`Followed up - no signs`),
    `Became case`= color_tile("white", "grey")(`Became case`),
    `Lost to follow up`= color_tile("white", "grey")(`Lost to follow up`), 
    `Never seen`= color_tile("white", "red")(`Never seen`),
    `Active contacts` = color_tile("white", "#81A4CE")(`Active contacts`)
  ) %>%
  kable("html", escape = F, align =c("l","c","c","c","c","c","c","c","c","c","c")) %>%
  kable_styling("hover", full_width = FALSE) %>%
  add_header_above(c(" " = 3, 
                     "Of contacts currently under follow up" = 5,
                     "Status of last visit" = 3)) %>% 
  kableExtra::footnote(general = str_glue("Data are current to {format(report_date, '%b %d %Y')}"))
Of contacts currently under follow up
Status of last visit
Admin 2 Registered contacts Active contacts In first week In second week Became case Lost to follow up Never seen Followed up - signs Followed up - no signs Not Followed up
Djembe 59 30 44 0 2 15 22 0 0 0
Trumpet 3 1 3 0 0 0 0 0 0 0
Venu 2 0 0 0 2 0 2 0 0 0
Congas 1 0 0 0 1 0 1 0 0 0
Cornet 1 0 1 0 1 0 1 0 0 0
Note:
Data are current to Jun 10 2020

25.5 Transmission Matrices

As discussed in the Heat plots page, you can create a matrix of “who infected whom” using geom_tile().

When new contacts are created, Go.Data stores this relationship information in the relationships API endpoint; and we can see the first 50 rows of this dataset below. This means that we can create a heat plot with relatively few steps given each contact is already joined to it’s source case.

As done above for the age pyramid comparing cases and contacts, we can select the few variables we need and create columns with categorical age groupings for both sources (cases) and targets (contacts).

heatmap_ages <- relationships %>% 
  select(source_age, target_age) %>% 
  mutate(                              # transmute is like mutate() but removes all other columns
    source_age_class = epikit::age_categories(source_age, breakers = seq(0, 80, 5)),
    target_age_class = epikit::age_categories(target_age, breakers = seq(0, 80, 5))) 

As described previously, we create cross-tabulation;

cross_tab <- table(
  source_cases = heatmap_ages$source_age_class,
  target_cases = heatmap_ages$target_age_class)

cross_tab
##             target_cases
## source_cases 0-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80+
##        0-4     0   0     0     0     0     0     0     0     0     1     0     1     0     0     0     0   0
##        5-9     0   0     1     0     0     0     0     1     0     0     0     1     0     0     0     0   0
##        10-14   0   0     0     0     0     0     0     0     0     0     0     0     0     0     0     0   0
##        15-19   0   0     0     0     0     0     0     0     0     0     0     0     0     0     0     0   0
##        20-24   1   1     0     1     2     0     2     1     0     0     0     1     0     0     0     0   1
##        25-29   1   2     0     0     0     0     0     0     0     0     0     0     0     0     0     0   0
##        30-34   0   0     0     0     0     0     0     0     1     1     0     1     0     0     0     0   0
##        35-39   0   2     0     0     0     0     0     0     0     1     0     0     0     0     0     0   0
##        40-44   0   0     0     0     1     0     2     1     0     3     1     1     0     0     0     1   1
##        45-49   1   2     2     0     0     0     3     0     1     0     3     2     1     0     0     0   1
##        50-54   1   2     1     2     0     0     1     0     0     3     4     1     0     1     0     0   1
##        55-59   0   1     0     0     1     1     2     0     0     0     0     0     0     0     0     0   0
##        60-64   0   0     0     0     0     0     0     0     0     0     0     0     0     0     0     0   0
##        65-69   0   0     0     0     0     0     0     0     0     0     0     0     0     0     0     0   0
##        70-74   0   0     0     0     0     0     0     0     0     0     0     0     0     0     0     0   0
##        75-79   0   0     0     0     0     0     0     0     0     0     0     0     0     0     0     0   0
##        80+     1   0     0     2     1     0     0     0     1     0     0     0     0     0     0     0   0

convert into long format with proportions;

long_prop <- data.frame(prop.table(cross_tab))

and create a heat-map for age.

ggplot(data = long_prop)+       # use long data, with proportions as Freq
  geom_tile(                    # visualize it in tiles
    aes(
      x = target_cases,         # x-axis is case age
      y = source_cases,     # y-axis is infector age
      fill = Freq))+            # color of the tile is the Freq column in the data
  scale_fill_gradient(          # adjust the fill color of the tiles
    low = "blue",
    high = "orange")+
  theme(axis.text.x = element_text(angle = 90))+
  labs(                         # labels
    x = "Target case age",
    y = "Source case age",
    title = "Who infected whom",
    subtitle = "Frequency matrix of transmission events",
    fill = "Proportion of all\ntranmsission events"     # legend title
  )

26 Survey analysis

26.1 Overview

This page demonstrates the use of several packages for survey analysis.

Most survey R packages rely on the survey package for doing weighted analysis. We will use survey as well as srvyr (a wrapper for survey allowing for tidyverse-style coding) and gtsummary (a wrapper for survey allowing for publication ready tables). While the original survey package does not allow for tidyverse-style coding, it does have the added benefit of allowing for survey-weighted generalised linear models (which will be added to this page at a later date). We will also demonstrate using a function from the sitrep package to create sampling weights (n.b this package is currently not yet on CRAN, but can be installed from github).

Most of this page is based off work done for the “R4Epis” project; for detailed code and R-markdown templates see the “R4Epis” github page. Some of the survey package based code is based off early versions of EPIET case studies.

At current this page does not address sample size calculations or sampling. For a simple to use sample size calculator see OpenEpi. The GIS basics page of the handbook will eventually have a section on spatial random sampling, and this page will eventually have a section on sampling frames as well as sample size calculations.

  1. Survey data
  2. Observation time
  3. Weighting
  4. Survey design objects
  5. Descriptive analysis
  6. Weighted proportions
  7. Weighted rates

26.2 Preparation

Packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load packages with library() from base R. See the page on R basics for more information on R packages.
Here we also demonstrate using the p_load_gh() function from pacman to install a load a package from github which has not yet been published on CRAN.

## load packages from CRAN
pacman::p_load(rio,          # File import
               here,         # File locator
               tidyverse,    # data management + ggplot2 graphics
               tsibble,      # handle time series datasets
               survey,       # for survey functions
               srvyr,        # dplyr wrapper for survey package
               gtsummary,    # wrapper for survey package to produce tables
               apyramid,     # a package dedicated to creating age pyramids
               patchwork,    # for combining ggplots
               ggforce       # for alluvial/sankey plots
               ) 

## load packages from github
pacman::p_load_gh(
     "R4EPI/sitrep"          # for observation time / weighting functions
)

Load data

The example dataset used in this section:

  • fictional mortality survey data.
  • fictional population counts for the survey area.
  • data dictionary for the fictional mortality survey data.

This is based off the MSF OCA ethical review board pre-approved survey. The fictional dataset was produced as part of the “R4Epis” project. This is all based off data collected using KoboToolbox, which is a data collection software based off Open Data Kit.

Kobo allows you to export both the collected data, as well as the data dictionary for that dataset. We strongly recommend doing this as it simplifies data cleaning and is useful for looking up variables/questions.

TIP: The Kobo data dictionary has variable names in the “name” column of the survey sheet. Possible values for each variable are specified in choices sheet. In the choices tab, “name” has the shortened value and the “label::english” and “label::french” columns have the appropriate long versions. Using the epidict package msf_dict_survey() function to import a Kobo dictionary excel file will re-format this for you so it can be used easily to recode.

CAUTION: The example dataset is not the same as an export (as in Kobo you export different questionnaire levels individually) - see the survey data section below to merge the different levels.

The dataset is imported using the import() function from the rio package. See the page on Import and export for various ways to import data.

# import the survey data
survey_data <- rio::import("survey_data.xlsx")

# import the dictionary into R
survey_dict <- rio::import("survey_dict.xlsx") 

The first 10 rows of the survey are displayed below.

We also want to import the data on sampling population so that we can produce appropriate weights. This data can be in different formats, however we would suggest to have it as seen below (this can just be typed in to an excel).

# import the population data
population <- rio::import("population.xlsx")

The first 10 rows of the survey are displayed below.

For cluster surveys you may want to add survey weights at the cluster level. You could read this data in as above. Alternatively if there are only a few counts, these could be entered as below in to a tibble. In any case you will need to have one column with a cluster identifier which matches your survey data, and another column with the number of households in each cluster.

## define the number of households in each cluster
cluster_counts <- tibble(cluster = c("village_1", "village_2", "village_3", "village_4", 
                                     "village_5", "village_6", "village_7", "village_8",
                                     "village_9", "village_10"), 
                         households = c(700, 400, 600, 500, 300, 
                                        800, 700, 400, 500, 500))

Clean data

The below makes sure that the date column is in the appropriate format. There are several other ways of doing this (see the Working with dates page for details), however using the dictionary to define dates is quick and easy.

We also create an age group variable using the age_categories() function from epikit - see cleaning data handbook section for details. In addition, we create a character variable defining which district the various clusters are in.

Finally, we recode all of the yes/no variables to TRUE/FALSE variables - otherwise these cant be used by the survey proportion functions.

## select the date variable names from the dictionary 
DATEVARS <- survey_dict %>% 
  filter(type == "date") %>% 
  filter(name %in% names(survey_data)) %>% 
  ## filter to match the column names of your data
  pull(name) # select date vars
  
## change to dates 
survey_data <- survey_data %>%
  mutate(across(all_of(DATEVARS), as.Date))


## add those with only age in months to the year variable (divide by twelve)
survey_data <- survey_data %>% 
  mutate(age_years = if_else(is.na(age_years), 
                             age_months / 12, 
                             age_years))

## define age group variable
survey_data <- survey_data %>% 
     mutate(age_group = age_categories(age_years, 
                                    breakers = c(0, 3, 15, 30, 45)
                                    ))


## create a character variable based off groups of a different variable 
survey_data <- survey_data %>% 
  mutate(health_district = case_when(
    cluster_number %in% c(1:5) ~ "district_a", 
    TRUE ~ "district_b"
  ))


## select the yes/no variable names from the dictionary 
YNVARS <- survey_dict %>% 
  filter(type == "yn") %>% 
  filter(name %in% names(survey_data)) %>% 
  ## filter to match the column names of your data
  pull(name) # select yn vars
  
## change to dates 
survey_data <- survey_data %>%
  mutate(across(all_of(YNVARS), 
                str_detect, 
                pattern = "yes"))

26.3 Survey data

There numerous different sampling designs that can be used for surveys. Here we will demonstrate code for: - Stratified - Cluster - Stratified and cluster

As described above (depending on how you design your questionnaire) the data for each level would be exported as a separate dataset from Kobo. In our example there is one level for households and one level for individuals within those households.

These two levels are linked by a unique identifier. For a Kobo dataset this variable is "_index" at the household level, which matches the "_parent_index" at the individual level. This will create new rows for household with each matching individual, see the handbook section on joining for details.

## join the individual and household data to form a complete data set
survey_data <- left_join(survey_data_hh, 
                         survey_data_indiv,
                         by = c("_index" = "_parent_index"))


## create a unique identifier by combining indeces of the two levels 
survey_data <- survey_data %>% 
     mutate(uid = str_glue("{index}_{index_y}"))

26.4 Observation time

For mortality surveys we want to now how long each individual was present for in the location to be able to calculate an appropriate mortality rate for our period of interest. This is not relevant to all surveys, but particularly for mortality surveys this is important as they are conducted frequently among mobile or displaced populations.

To do this we first define our time period of interest, also known as a recall period (i.e. the time that participants are asked to report on when answering questions). We can then use this period to set inappropriate dates to missing, i.e. if deaths are reported from outside the period of interest.

## set the start/end of recall period
## can be changed to date variables from dataset 
## (e.g. arrival date & date questionnaire)
survey_data <- survey_data %>% 
  mutate(recall_start = as.Date("2018-01-01"), 
         recall_end   = as.Date("2018-05-01")
  )


# set inappropriate dates to NA based on rules 
## e.g. arrivals before start, departures departures after end
survey_data <- survey_data %>%
      mutate(
           arrived_date = if_else(arrived_date < recall_start, 
                                 as.Date(NA),
                                  arrived_date),
           birthday_date = if_else(birthday_date < recall_start,
                                  as.Date(NA),
                                  birthday_date),
           left_date = if_else(left_date > recall_end,
                              as.Date(NA),
                               left_date),
           death_date = if_else(death_date > recall_end,
                               as.Date(NA),
                               death_date)
           )

We can then use our date variables to define start and end dates for each individual. We can use the find_start_date() function from sitrep to fine the causes for the dates and then use that to calculate the difference between days (person-time).

start date: Earliest appropriate arrival event within your recall period Either the beginning of your recall period (which you define in advance), or a date after the start of recall if applicable (e.g. arrivals or births)

end date: Earliest appropriate departure event within your recall period Either the end of your recall period, or a date before the end of recall if applicable (e.g. departures, deaths)

## create new variables for start and end dates/causes
survey_data <- survey_data %>% 
     ## choose earliest date entered in survey
     ## from births, household arrivals, and camp arrivals 
     find_start_date("birthday_date",
                  "arrived_date",
                  period_start = "recall_start",
                  period_end   = "recall_end",
                  datecol      = "startdate",
                  datereason   = "startcause" 
                 ) %>%
     ## choose earliest date entered in survey
     ## from camp departures, death and end of the study
     find_end_date("left_date",
                "death_date",
                period_start = "recall_start",
                period_end   = "recall_end",
                datecol      = "enddate",
                datereason   = "endcause" 
               )


## label those that were present at the start/end (except births/deaths)
survey_data <- survey_data %>% 
     mutate(
       ## fill in start date to be the beginning of recall period (for those empty) 
       startdate = if_else(is.na(startdate), recall_start, startdate), 
       ## set the start cause to present at start if equal to recall period 
       ## unless it is equal to the birth date 
       startcause = if_else(startdate == recall_start & startcause != "birthday_date",
                              "Present at start", startcause), 
       ## fill in end date to be end of recall period (for those empty) 
       enddate = if_else(is.na(enddate), recall_end, enddate), 
       ## set the end cause to present at end if equall to recall end 
       ## unless it is equal to the death date
       endcause = if_else(enddate == recall_end & endcause != "death_date", 
                            "Present at end", endcause))


## Define observation time in days
survey_data <- survey_data %>% 
  mutate(obstime = as.numeric(enddate - startdate))

26.5 Weighting

It is important that you drop erroneous observations before adding survey weights. For example if you have observations with negative observation time, you will need to check those (you can do this with the assert_positive_timespan() function from sitrep. Another thing is if you want to drop empty rows (e.g. with drop_na(uid)) or remove duplicates (see handbook section on De-duplication for details). Those without consent need to be dropped too.

In this example we filter for the cases we want to drop and store them in a separate data frame - this way we can describe those that were excluded from the survey. We then use the anti_join() function from dplyr to remove these dropped cases from our survey data.

DANGER: You cant have missing values in your weight variable, or any of the variables relevant to your survey design (e.g. age, sex, strata or cluster variables).

## store the cases that you drop so you can describe them (e.g. non-consenting 
## or wrong village/cluster)
dropped <- survey_data %>% 
  filter(!consent | is.na(startdate) | is.na(enddate) | village_name == "other")

## use the dropped cases to remove the unused rows from the survey data set  
survey_data <- anti_join(survey_data, dropped, by = names(dropped))

As mentioned above we demonstrate how to add weights for three different study designs (stratified, cluster and stratified cluster). These require information on the source population and/or the clusters surveyed. We will use the stratified cluster code for this example, but use whichever is most appropriate for your study design.

# stratified ------------------------------------------------------------------
# create a variable called "surv_weight_strata"
# contains weights for each individual - by age group, sex and health district
survey_data <- add_weights_strata(x = survey_data,
                                         p = population,
                                         surv_weight = "surv_weight_strata",
                                         surv_weight_ID = "surv_weight_ID_strata",
                                         age_group, sex, health_district)

## cluster ---------------------------------------------------------------------

# get the number of people of individuals interviewed per household
# adds a variable with counts of the household (parent) index variable
survey_data <- survey_data %>%
  add_count(index, name = "interviewed")


## create cluster weights
survey_data <- add_weights_cluster(x = survey_data,
                                          cl = cluster_counts,
                                          eligible = member_number,
                                          interviewed = interviewed,
                                          cluster_x = village_name,
                                          cluster_cl = cluster,
                                          household_x = index,
                                          household_cl = households,
                                          surv_weight = "surv_weight_cluster",
                                          surv_weight_ID = "surv_weight_ID_cluster",
                                          ignore_cluster = FALSE,
                                          ignore_household = FALSE)


# stratified and cluster ------------------------------------------------------
# create a survey weight for cluster and strata
survey_data <- survey_data %>%
  mutate(surv_weight_cluster_strata = surv_weight_strata * surv_weight_cluster)

26.6 Survey design objects

Create survey object according to your study design. Used the same way as data frames to calculate weight proportions etc. Make sure that all necessary variables are created before this.

There are four options, comment out those you do not use: - Simple random - Stratified - Cluster - Stratified cluster

For this template - we will pretend that we cluster surveys in two separate strata (health districts A and B). So to get overall estimates we need have combined cluster and strata weights.

As mentioned previously, there are two packages available for doing this. The classic one is survey and then there is a wrapper package called srvyr that makes tidyverse-friendly objects and functions. We will demonstrate both, but note that most of the code in this chapter will use srvyr based objects. The one exception is that the gtsummary package only accepts survey objects.

26.6.1 Survey package

The survey package effectively uses base R coding, and so it is not possible to use pipes (%>%) or other dplyr syntax. With the survey package we use the svydesign() function to define a survey object with appropriate clusters, weights and strata.

NOTE: we need to use the tilde (~) in front of variables, this is because the package uses the base R syntax of assigning variables based on formulae.

# simple random ---------------------------------------------------------------
base_survey_design_simple <- svydesign(ids = ~1, # 1 for no cluster ids
                   weights = NULL,               # No weight added
                   strata = NULL,                # sampling was simple (no strata)
                   data = survey_data            # have to specify the dataset
                  )

## stratified ------------------------------------------------------------------
base_survey_design_strata <- svydesign(ids = ~1,  # 1 for no cluster ids
                   weights = ~surv_weight_strata, # weight variable created above
                   strata = ~health_district,     # sampling was stratified by district
                   data = survey_data             # have to specify the dataset
                  )

# cluster ---------------------------------------------------------------------
base_survey_design_cluster <- svydesign(ids = ~village_name, # cluster ids
                   weights = ~surv_weight_cluster, # weight variable created above
                   strata = NULL,                 # sampling was simple (no strata)
                   data = survey_data              # have to specify the dataset
                  )

# stratified cluster ----------------------------------------------------------
base_survey_design <- svydesign(ids = ~village_name,      # cluster ids
                   weights = ~surv_weight_cluster_strata, # weight variable created above
                   strata = ~health_district,             # sampling was stratified by district
                   data = survey_data                     # have to specify the dataset
                  )

26.6.2 Srvyr package

With the srvyr package we can use the as_survey_design() function, which has all the same arguments as above but allows pipes (%>%), and so we do not need to use the tilde (~).

## simple random ---------------------------------------------------------------
survey_design_simple <- survey_data %>% 
  as_survey_design(ids = 1, # 1 for no cluster ids 
                   weights = NULL, # No weight added
                   strata = NULL # sampling was simple (no strata)
                  )
## stratified ------------------------------------------------------------------
survey_design_strata <- survey_data %>%
  as_survey_design(ids = 1, # 1 for no cluster ids
                   weights = surv_weight_strata, # weight variable created above
                   strata = health_district # sampling was stratified by district
                  )
## cluster ---------------------------------------------------------------------
survey_design_cluster <- survey_data %>%
  as_survey_design(ids = village_name, # cluster ids
                   weights = surv_weight_cluster, # weight variable created above
                   strata = NULL # sampling was simple (no strata)
                  )

## stratified cluster ----------------------------------------------------------
survey_design <- survey_data %>%
  as_survey_design(ids = village_name, # cluster ids
                   weights = surv_weight_cluster_strata, # weight variable created above
                   strata = health_district # sampling was stratified by district
                  )

26.7 Descriptive analysis

Basic descriptive analysis and visualisation is covered extensively in other chapters of the handbook, so we will not dwell on it here. For details see the chapters on descriptive tables, statistical tests, tables for presentation, ggplot basics and R markdown reports.

In this section we will focus on how to investigate bias in your sample and visualise this. We will also look at visualising population flow in a survey setting using alluvial/sankey diagrams.

In general, you should consider including the following descriptive analyses:

  • Final number of clusters, households and individuals included
  • Number of excluded individuals and the reasons for exclusion
  • Median (range) number of households per cluster and individuals per household

26.7.1 Sampling bias

Compare the proportions in each age group between your sample and the source population. This is important to be able to highlight potential sampling bias. You could similarly repeat this looking at distributions by sex.

Note that these p-values are just indicative, and a descriptive discussion (or visualisation with age-pyramids below) of the distributions in your study sample compared to the source population is more important than the binomial test itself. This is because increasing sample size will more often than not lead to differences that may be irrelevant after weighting your data.

## counts and props of the study population
ag <- survey_data %>% 
  group_by(age_group) %>% 
  drop_na(age_group) %>% 
  tally() %>% 
  mutate(proportion = n / sum(n), 
         n_total = sum(n))

## counts and props of the source population
propcount <- population %>% 
  group_by(age_group) %>%
    tally(population) %>%
    mutate(proportion = n / sum(n))

## bind together the columns of two tables, group by age, and perform a 
## binomial test to see if n/total is significantly different from population
## proportion.
  ## suffix here adds to text to the end of columns in each of the two datasets
left_join(ag, propcount, by = "age_group", suffix = c("", "_pop")) %>%
  group_by(age_group) %>%
  ## broom::tidy(binom.test()) makes a data frame out of the binomial test and
  ## will add the variables p.value, parameter, conf.low, conf.high, method, and
  ## alternative. We will only use p.value here. You can include other
  ## columns if you want to report confidence intervals
  mutate(binom = list(broom::tidy(binom.test(n, n_total, proportion_pop)))) %>%
  unnest(cols = c(binom)) %>% # important for expanding the binom.test data frame
  mutate(proportion_pop = proportion_pop * 100) %>%
  ## Adjusting the p-values to correct for false positives 
  ## (because testing multiple age groups). This will only make 
  ## a difference if you have many age categories
  mutate(p.value = p.adjust(p.value, method = "holm")) %>%
                      
  ## Only show p-values over 0.001 (those under report as <0.001)
  mutate(p.value = ifelse(p.value < 0.001, 
                          "<0.001", 
                          as.character(round(p.value, 3)))) %>% 
  
  ## rename the columns appropriately
  select(
    "Age group" = age_group,
    "Study population (n)" = n,
    "Study population (%)" = proportion,
    "Source population (n)" = n_pop,
    "Source population (%)" = proportion_pop,
    "P-value" = p.value
  )
## # A tibble: 5 x 6
## # Groups:   Age group [5]
##   `Age group` `Study population (n)` `Study population (%)` `Source population (n)` `Source population (%)` `P-value`
##   <chr>                        <int>                  <dbl>                   <dbl>                   <dbl> <chr>    
## 1 0-2                             12                 0.0256                    1360                     6.8 <0.001   
## 2 3-14                            42                 0.0896                    7244                    36.2 <0.001   
## 3 15-29                           64                 0.136                     5520                    27.6 <0.001   
## 4 30-44                           52                 0.111                     3232                    16.2 0.002    
## 5 45+                            299                 0.638                     2644                    13.2 <0.001

26.7.2 Demographic pyramids

Demographic (or age-sex) pyramids are an easy way of visualising the distribution in your survey population. It is also worth considering creating descriptive tables of age and sex by survey strata. We will demonstrate using the apyramid package as it allows for weighted proportions using our survey design object created above. Other options for creating demographic pyramids are covered extensively in that chapter of the handbook. We will also use a wrapper function from sitrep called plot_age_pyramid() which saves a few lines of coding for producing a plot with proportions.

As with the formal binomial test of difference, seen above in the sampling bias section, we are interested here in visualising whether our sampled population is substantially different from the source population and whether weighting corrects this difference. To do this we will use the patchwork package to show our ggplot visualisations side-by-side; for details see the section on combining plots in ggplot tips chapter of the handbook. We will visualise our source population, our un-weighted survey population and our weighted survey population. You may also consider visualising by each strata of your survey - in our example here that would be by using the argument stack_by = "health_district" (see ?plot_age_pyramid for details).

NOTE: The x and y axes are flipped in pyramids

## define x-axis limits and labels ---------------------------------------------
## (update these numbers to be the values for your graph)
max_prop <- 35      # choose the highest proportion you want to show 
step <- 5           # choose the space you want beween labels 

## this part defines vector using the above numbers with axis breaks
breaks <- c(
    seq(max_prop/100 * -1, 0 - step/100, step/100), 
    0, 
    seq(0 + step / 100, max_prop/100, step/100)
    )

## this part defines vector using the above numbers with axis limits
limits <- c(max_prop/100 * -1, max_prop/100)

## this part defines vector using the above numbers with axis labels
labels <-  c(
      seq(max_prop, step, -step), 
      0, 
      seq(step, max_prop, step)
    )


## create plots individually  --------------------------------------------------

## plot the source population 
## nb: this needs to be collapsed for the overall population (i.e. removing health districts)
source_population <- population %>%
  ## ensure that age and sex are factors
  mutate(age_group = factor(age_group, 
                            levels = c("0-2", 
                                       "3-14", 
                                       "15-29",
                                       "30-44", 
                                       "45+")), 
         sex = factor(sex)) %>% 
  group_by(age_group, sex) %>% 
  ## add the counts for each health district together 
  summarise(population = sum(population)) %>% 
  ## remove the grouping so can calculate overall proportion
  ungroup() %>% 
  mutate(proportion = population / sum(population)) %>% 
  ## plot pyramid 
  age_pyramid(
            age_group = age_group, 
            split_by = sex, 
            count = proportion, 
            proportional = TRUE) +
  ## only show the y axis label (otherwise repeated in all three plots)
  labs(title = "Source population", 
       y = "", 
       x = "Age group (years)") + 
  ## make the x axis the same for all plots 
  scale_y_continuous(breaks = breaks, 
    limits = limits, 
    labels = labels)
  
  
## plot the unweighted sample population 
sample_population <- plot_age_pyramid(survey_data, 
                 age_group = "age_group", 
                 split_by = "sex",
                 proportion = TRUE) + 
  ## only show the x axis label (otherwise repeated in all three plots)
  labs(title = "Unweighted sample population", 
       y = "Proportion (%)", 
       x = "") + 
  ## make the x axis the same for all plots 
  scale_y_continuous(breaks = breaks, 
    limits = limits, 
    labels = labels)


## plot the weighted sample population 
weighted_population <- survey_design %>% 
  ## make sure the variables are factors
  mutate(age_group = factor(age_group), 
         sex = factor(sex)) %>%
  plot_age_pyramid(
    age_group = "age_group",
    split_by = "sex", 
    proportion = TRUE) +
  ## only show the x axis label (otherwise repeated in all three plots)
  labs(title = "Weighted sample population", 
       y = "", 
       x = "")  + 
  ## make the x axis the same for all plots 
  scale_y_continuous(breaks = breaks, 
    limits = limits, 
    labels = labels)

## combine all three plots  ----------------------------------------------------
## combine three plots next to eachother using + 
source_population + sample_population + weighted_population + 
  ## only show one legend and define theme 
  ## note the use of & for combining theme with plot_layout()
  plot_layout(guides = "collect") & 
  theme(legend.position = "bottom",                    # move legend to bottom
        legend.title = element_blank(),                # remove title
        text = element_text(size = 18),                # change text size
        axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1) # turn x-axis text
       )

26.7.3 Alluvial/sankey diagram

Visualising starting points and outcomes for individuals can be very helpful to get an overview. There is quite an obvious application for mobile populations, however there are numerous other applications such as cohorts or any other situation where there are transitions in states for individuals. These diagrams have several different names including alluvial, sankey and parallel sets - the details are in the handbook chapter on diagrams and charts.

## summarize data
flow_table <- survey_data %>%
  count(startcause, endcause, sex) %>%  # get counts 
  gather_set_data(x = c("startcause", "endcause")) %>%     # change format for plotting
  mutate(x = fct_relevel(x, c("startcause", "endcause")),  # set startcause as first level
         x = fct_recode(x, 
                        "Start \n cause" = "startcause",   # add line break (\n) after start
                        "End \n cause"   = "endcause")
        )


## plot your dataset 
  ## on the x axis is the start and end causes
  ## gather_set_data generates an ID for each possible combination
  ## splitting by y gives the possible start/end combos
  ## value as n gives it as counts (could also be changed to proportion)
ggplot(flow_table, aes(x, id = id, split = y, value = n)) +
  ## colour lines by sex 
  geom_parallel_sets(aes(fill = sex), alpha = 0.5, axis.width = 0.2) +
  ## fill in the label boxes grey
  geom_parallel_sets_axes(axis.width = 0.15, fill = "grey80", color = "grey80") +
  ## change text colour and angle (needs to be adjusted)
  geom_parallel_sets_labels(color = "black", angle = 0, size = 5) +
  ## adjusted y and x axes (probably needs more vertical space)
  scale_x_discrete(name = NULL, expand = c(0, 0.2)) + 
  ## remove axis labels
  theme(
    title = element_text(size = 26),
    text = element_text(size = 26),
    axis.line = element_blank(),
    axis.ticks = element_blank(),
    axis.text.y = element_blank(),
    panel.background = element_blank(),
    legend.position = "bottom",                    # move legend to bottom
    legend.title = element_blank(),                # remove title
  )

26.8 Weighted proportions

This section will detail how to produce tables for weighted counts and proportions, with associated confidence intervals and design effect. There are four different options using functions from the following packages: survey, srvyr, sitrep and gtsummary. For minimal coding to produce a standard epidemiology style table, we would recommend the sitrep function - which is a wrapper for srvyr code; note however that this is not yet on CRAN and may change in the future. Otherwise, the survey code is likely to be the most stable long-term, whereas srvyr will fit most nicely within tidyverse work-flows. While gtsummary functions hold a lot of potential, they appear to be experimental and incomplete at the time of writing.

26.8.1 Survey package

We can use the svyciprop() function from survey to get weighted proportions and accompanying 95% confidence intervals. An appropriate design effect can be extracted using the svymean() rather than svyprop() function. It is worth noting that svyprop() only appears to accept variables between 0 and 1 (or TRUE/FALSE), so categorical variables will not work.

NOTE: Functions from survey also accept srvyr design objects, but here we have used the survey design object just for consistency

## produce weighted counts 
svytable(~died, base_survey_design)
## died
##      FALSE       TRUE 
## 1406244.43   76213.01
## produce weighted proportions
svyciprop(~died, base_survey_design, na.rm = T)
##               2.5% 97.5%
## died 0.0514 0.0208  0.12
## get the design effect 
svymean(~died, base_survey_design, na.rm = T, deff = T) %>% 
  deff()
## diedFALSE  diedTRUE 
##  3.755508  3.755508

We can combine the functions from survey shown above in to a function which we define ourselves below, called svy_prop; and we can then use that function together with map() from the purrr package to iterate over several variables and create a table. See the handbook iteration chapter for details on purrr.

# Define function to calculate weighted counts, proportions, CI and design effect
# x is the variable in quotation marks 
# design is your survey design object

svy_prop <- function(design, x) {
  
  ## put the variable of interest in a formula 
  form <- as.formula(paste0( "~" , x))
  ## only keep the TRUE column of counts from svytable
  weighted_counts <- svytable(form, design)[[2]]
  ## calculate proportions (multiply by 100 to get percentages)
  weighted_props <- svyciprop(form, design, na.rm = TRUE) * 100
  ## extract the confidence intervals and multiply to get percentages
  weighted_confint <- confint(weighted_props) * 100
  ## use svymean to calculate design effect and only keep the TRUE column
  design_eff <- deff(svymean(form, design, na.rm = TRUE, deff = TRUE))[[TRUE]]
  
  ## combine in to one data frame
  full_table <- cbind(
    "Variable"        = x,
    "Count"           = weighted_counts,
    "Proportion"      = weighted_props,
    weighted_confint, 
    "Design effect"   = design_eff
    )
  
  ## return table as a dataframe
  full_table <- data.frame(full_table, 
             ## remove the variable names from rows (is a separate column now)
             row.names = NULL)
  
  ## change numerics back to numeric
  full_table[ , 2:6] <- as.numeric(full_table[, 2:6])
  
  ## return dataframe
  full_table
}

## iterate over several variables to create a table 
purrr::map(
  ## define variables of interest
  c("left", "died", "arrived"), 
  ## state function using and arguments for that function (design)
  svy_prop, design = base_survey_design) %>% 
  ## collapse list in to a single data frame
  bind_rows() %>% 
  ## round 
  mutate(across(where(is.numeric), round, digits = 1))
##   Variable    Count Proportion X2.5. X97.5. Design.effect
## 1     left 701199.1       47.3  39.2   55.5           2.4
## 2     died  76213.0        5.1   2.1   12.1           3.8
## 3  arrived 761799.0       51.4  40.9   61.7           3.9

26.8.2 Srvyr package

With srvyr we can use dplyr syntax to create a table. Note that the survey_mean() function is used and the proportion argument is specified, and also that the same function is used to calculate design effect. This is because srvyr wraps around both of the survey package functions svyciprop() and svymean(), which are used in the above section.

NOTE: It does not seem to be possible to get proportions from categorical variables using srvyr either, if you need this then check out the section below using sitrep

## use the srvyr design object
survey_design %>% 
  summarise(
    ## produce the weighted counts 
    counts = survey_total(died), 
    ## produce weighted proportions and confidence intervals 
    ## multiply by 100 to get a percentage 
    props = survey_mean(died, 
                        proportion = TRUE, 
                        vartype = "ci") * 100, 
    ## produce the design effect 
    deff = survey_mean(died, deff = TRUE)) %>% 
  ## only keep the rows of interest
  ## (drop standard errors and repeat proportion calculation)
  select(counts, props, props_low, props_upp, deff_deff)
##     counts    props props_low props_upp deff_deff
## 1 76213.01 5.140991  2.082773  12.13328  3.755508

Here too we could write a function to then iterate over multiple variables using the purrr package. See the handbook iteration chapter for details on purrr.

# Define function to calculate weighted counts, proportions, CI and design effect
# design is your survey design object
# x is the variable in quotation marks 


srvyr_prop <- function(design, x) {
  
  summarise(
    ## using the survey design object
    design, 
    ## produce the weighted counts 
    counts = survey_total(.data[[x]]), 
    ## produce weighted proportions and confidence intervals 
    ## multiply by 100 to get a percentage 
    props = survey_mean(.data[[x]], 
                        proportion = TRUE, 
                        vartype = "ci") * 100, 
    ## produce the design effect 
    deff = survey_mean(.data[[x]], deff = TRUE)) %>% 
  ## add in the variable name
  mutate(variable = x) %>% 
  ## only keep the rows of interest
  ## (drop standard errors and repeat proportion calculation)
  select(variable, counts, props, props_low, props_upp, deff_deff)
  
}
  

## iterate over several variables to create a table 
purrr::map(
  ## define variables of interest
  c("left", "died", "arrived"), 
  ## state function using and arguments for that function (design)
  ~srvyr_prop(.x, design = survey_design)) %>% 
  ## collapse list in to a single data frame
  bind_rows()
##   variable    counts     props props_low props_upp deff_deff
## 1     left 701199.14 47.299782 39.235598  55.50736  2.379761
## 2     died  76213.01  5.140991  2.082773  12.13328  3.755508
## 3  arrived 761799.05 51.387583 40.927349  61.72766  3.925504

26.8.3 Sitrep package

The tab_survey() function from sitrep is a wrapper for srvyr, allowing you to create weighted tables with minimal coding. It also allows you to calculate weighted proportions for categorical variables.

## using the survey design object
survey_design %>% 
  ## pass the names of variables of interest unquoted
  tab_survey(arrived, left, died, education_level,
             deff = TRUE,   # calculate the design effect
             pretty = TRUE  # merge the proportion and 95%CI
             )
## Warning: removing 257 missing value(s) from `education_level`
## # A tibble: 9 x 5
##   variable        value            n  deff ci                
##   <chr>           <chr>        <dbl> <dbl> <chr>             
## 1 arrived         TRUE       761799.  3.93 51.4% (40.9--61.7)
## 2 arrived         FALSE      720658.  3.93 48.6% (38.3--59.1)
## 3 left            TRUE       701199.  2.38 47.3% (39.2--55.5)
## 4 left            FALSE      781258.  2.38 52.7% (44.5--60.8)
## 5 died            TRUE        76213.  3.76 5.1% (2.1--12.1)  
## 6 died            FALSE     1406244.  3.76 94.9% (87.9--97.9)
## 7 education_level higher     171644.  4.70 42.4% (26.9--59.7)
## 8 education_level primary    102609.  2.37 25.4% (16.2--37.3)
## 9 education_level secondary  130201.  6.68 32.2% (16.5--53.3)

26.8.4 Gtsummary package

With gtsummary there does not seem to be inbuilt functions yet to add confidence intervals or design effect. Here we show how to define a function for adding confidence intervals and then add confidence intervals to a gtsummary table created using the tbl_svysummary() function.

confidence_intervals <- function(data, variable, by, ...) {
  
  ## extract the confidence intervals and multiply to get percentages
  props <- svyciprop(as.formula(paste0( "~" , variable)),
              data, na.rm = TRUE)
  
  ## extract the confidence intervals 
  as.numeric(confint(props) * 100) %>% ## make numeric and multiply for percentage
    round(., digits = 1) %>%           ## round to one digit
    c(.) %>%                           ## extract the numbers from matrix
    paste0(., collapse = "-")          ## combine to single character
}

## using the survey package design object
tbl_svysummary(base_survey_design, 
               include = c(arrived, left, died),   ## define variables want to include
               statistic = list(everything() ~ c("{n} ({p}%)"))) %>% ## define stats of interest
  add_n() %>%  ## add the weighted total 
  add_stat(fns = everything() ~ confidence_intervals) %>% ## add CIs
  ## modify the column headers
  modify_header(
    list(
      n ~ "**Weighted total (N)**",
      stat_0 ~ "**Weighted Count**",
      add_stat_1 ~ "**95%CI**"
    )
    )
Characteristic Weighted total (N) Weighted Count1 95%CI
arrived 1,482,457 761,799 (51%) 40.9-61.7
left 1,482,457 701,199 (47%) 39.2-55.5
died 1,482,457 76,213 (5.1%) 2.1-12.1

1 n (%)

26.9 Weighted ratios

Similarly for weighted ratios (such as for mortality ratios) you can use the survey or the srvyr package. You could similarly write functions (similar to those above) to iterate over several variables. You could also create a function for gtsummary as above but currently it does not have inbuilt functionality.

26.9.1 Survey package

ratio <- svyratio(~died, 
         denominator = ~obstime, 
         design = base_survey_design)

ci <- confint(ratio)

cbind(
  ratio$ratio * 10000, 
  ci * 10000
)
##       obstime    2.5 %   97.5 %
## died 5.981922 1.194294 10.76955

26.9.2 Srvyr package

survey_design %>% 
  ## survey ratio used to account for observation time 
  summarise(
    mortality = survey_ratio(
      as.numeric(died) * 10000, 
      obstime, 
      vartype = "ci")
    )
##   mortality mortality_low mortality_upp
## 1  5.981922     0.3490176      11.61483

27 Survival analysis

27.1 Overview

Survival analysis focuses on describing for a given individual or group of individuals, a defined point of event called the failure (occurrence of a disease, cure from a disease, death, relapse after response to treatment…) that occurs after a period of time called failure time (or follow-up time in cohort/population-based studies) during which individuals are observed. To determine the failure time, it is then necessary to define a time of origin (that can be the inclusion date, the date of diagnosis…).

The target of inference for survival analysis is then the time between an origin and an event. In current medical research, it is widely used in clinical studies to assess the effect of a treatment for instance, or in cancer epidemiology to assess a large variety of cancer survival measures.

It is usually expressed through the survival probability which is the probability that the event of interest has not occurred by a duration t.

Censoring: Censoring occurs when at the end of follow-up, some of the individuals have not had the event of interest, and thus their true time to event is unknown. We will mostly focus on right censoring here but for more details on censoring and survival analysis in general, you can see references.

27.2 Preparation

Load packages

To run survival analyses in R, one the most widely used package is the survival package. We first install it and then load it as well as the other packages that will be used in this section:

In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

This page explores survival analyses using the linelist used in most of the previous pages and on which we apply some changes to have a proper survival data.

Import dataset

We import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import data with the import() function from the rio package (it handles many file types like .xlsx, .csv, .rds - see the Import and export page for details).

# import linelist
linelist_case_data <- rio::import("linelist_cleaned.rds")

Data management and transformation

In short, survival data can be described as having the following three characteristics:

  1. the dependent variable or response is the waiting time until the occurrence of a well-defined event,
  2. observations are censored, in the sense that for some units the event of interest has not occurred at the time the data are analyzed, and
  3. there are predictors or explanatory variables whose effect on the waiting time we wish to assess or control.

Thus, we will create different variables needed to respect that structure and run the survival analysis.

We define:

  • a new data frame linelist_surv for this analysis
  • our event of interest as being “death” (hence our survival probability will be the probability of being alive after a certain time after the time of origin),
  • the follow-up time (futime) as the time between the time of onset and the time of outcome in days,
  • censored patients as those who recovered or for whom the final outcome is not known ie the event “death” was not observed (event=0).

CAUTION: Since in a real cohort study, the information on the time of origin and the end of the follow-up is known given individuals are observed, we will remove observations where the date of onset or the date of outcome is unknown. Also the cases where the date of onset is later than the date of outcome will be removed since they are considered as wrong.

TIP: Given that filtering to greater than (>) or less than (<) a date can remove rows with missing values, applying the filter on the wrong dates will also remove the rows with missing dates.

We then use case_when() to create a column age_cat_small in which there are only 3 age categories.

#create a new data called linelist_surv from the linelist_case_data

linelist_surv <-  linelist_case_data %>% 
     
  dplyr::filter(
       # remove observations with wrong or missing dates of onset or date of outcome
       date_outcome > date_onset) %>% 
  
  dplyr::mutate(
       # create the event var which is 1 if the patient died and 0 if he was right censored
       event = ifelse(is.na(outcome) | outcome == "Recover", 0, 1), 
    
       # create the var on the follow-up time in days
       futime = as.double(date_outcome - date_onset), 
    
       # create a new age category variable with only 3 strata levels
       age_cat_small = dplyr::case_when( 
            age_years < 5  ~ "0-4",
            age_years >= 5 & age_years < 20 ~ "5-19",
            age_years >= 20   ~ "20+"),
       
       # previous step created age_cat_small var as character.
       # now convert it to factor and specify the levels.
       # Note that the NA values remain NA's and are not put in a level "unknown" for example,
       # since in the next analyses they have to be removed.
       age_cat_small = fct_relevel(age_cat_small, "0-4", "5-19", "20+")
       )

TIP: We can verify the new columns we have created by doing a summary on the futime and a cross-tabulation between event and outcome from which it was created. Besides this verification it is a good habit to communicate the median follow-up time when interpreting survival analysis results.

summary(linelist_surv$futime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    6.00   10.00   11.98   16.00   64.00
# cross tabulate the new event var and the outcome var from which it was created
# to make sure the code did what it was intended to
linelist_surv %>% 
  tabyl(outcome, event)
##  outcome    0    1
##    Death    0 1952
##  Recover 1547    0
##     <NA> 1040    0

Now we cross-tabulate the new age_cat_small var and the old age_cat col to ensure correct assingments

linelist_surv %>% 
  tabyl(age_cat_small, age_cat)
##  age_cat_small 0-4 5-9 10-14 15-19 20-29 30-49 50-69 70+ NA_
##            0-4 834   0     0     0     0     0     0   0   0
##           5-19   0 852   717   575     0     0     0   0   0
##            20+   0   0     0     0   862   554    69   5   0
##           <NA>   0   0     0     0     0     0     0   0  71

Now we review the 10 first observations of the linelist_surv data looking at specific variables (including those newly created).

linelist_surv %>% 
  select(case_id, age_cat_small, date_onset, date_outcome, outcome, event, futime) %>% 
  head(10)
##    case_id age_cat_small date_onset date_outcome outcome event futime
## 1   8689b7           0-4 2014-05-13   2014-05-18 Recover     0      5
## 2   11f8ea           20+ 2014-05-16   2014-05-30 Recover     0     14
## 3   893f25           0-4 2014-05-21   2014-05-29 Recover     0      8
## 4   be99c8          5-19 2014-05-22   2014-05-24 Recover     0      2
## 5   07e3e8          5-19 2014-05-27   2014-06-01 Recover     0      5
## 6   369449           0-4 2014-06-02   2014-06-07   Death     1      5
## 7   f393b4           20+ 2014-06-05   2014-06-18 Recover     0     13
## 8   1389ca           20+ 2014-06-05   2014-06-09   Death     1      4
## 9   2978ac          5-19 2014-06-06   2014-06-15   Death     1      9
## 10  fc15ef          5-19 2014-06-16   2014-07-09 Recover     0     23

We can also cross-tabulate the columns age_cat_small and gender to have more details on the distribution of this new column by gender. We use tabyl() and the adorn functions from janitor as described in the Descriptive tables page.

linelist_surv %>% 
  tabyl(gender, age_cat_small, show_na = F) %>% 
  adorn_totals(where = "both") %>% 
  adorn_percentages() %>% 
  adorn_pct_formatting() %>% 
  adorn_ns(position = "front")
##  gender         0-4         5-19          20+         Total
##       f 482 (22.4%) 1184 (54.9%)  490 (22.7%) 2156 (100.0%)
##       m 325 (15.0%)  880 (40.6%)  960 (44.3%) 2165 (100.0%)
##   Total 807 (18.7%) 2064 (47.8%) 1450 (33.6%) 4321 (100.0%)

27.3 Basics of survival analysis

Building a surv-type object

We will first use Surv() from survival to build a survival object from the follow-up time and event columns.

The result of such a step is to produce an object of type Surv that condenses the time information and whether the event of interest (death) was observed. This object will ultimately be used in the right-hand side of subsequent model formulae (see documentation).

# Use Suv() syntax for right-censored data
survobj <- Surv(time = linelist_surv$futime,
                event = linelist_surv$event)

To review, here are the first 10 rows of the linelist_surv data, viewing only some important columns.

linelist_surv %>% 
  select(case_id, date_onset, date_outcome, futime, outcome, event) %>% 
  head(10)
##    case_id date_onset date_outcome futime outcome event
## 1   8689b7 2014-05-13   2014-05-18      5 Recover     0
## 2   11f8ea 2014-05-16   2014-05-30     14 Recover     0
## 3   893f25 2014-05-21   2014-05-29      8 Recover     0
## 4   be99c8 2014-05-22   2014-05-24      2 Recover     0
## 5   07e3e8 2014-05-27   2014-06-01      5 Recover     0
## 6   369449 2014-06-02   2014-06-07      5   Death     1
## 7   f393b4 2014-06-05   2014-06-18     13 Recover     0
## 8   1389ca 2014-06-05   2014-06-09      4   Death     1
## 9   2978ac 2014-06-06   2014-06-15      9   Death     1
## 10  fc15ef 2014-06-16   2014-07-09     23 Recover     0

And here are the first 10 elements of survobj. It prints as essentially a vector of follow-up time, with “+” to represent if an observation was right-censored. See how the numbers align above and below.

#print the 50 first elements of the vector to see how it presents
head(survobj, 10)
##  [1]  5+ 14+  8+  2+  5+  5  13+  4   9  23+

Running initial analyses

We then start our analysis using the survfit() function to produce a survfit object, which fits the default calculations for Kaplan Meier (KM) estimates of the overall (marginal) survival curve, which are in fact a step function with jumps at observed event times. The final survfit object contains one or more survival curves and is created using the Surv object as a response variable in the model formula.

NOTE: The Kaplan-Meier estimate is a nonparametric maximum likelihood estimate (MLE) of the survival function. . (see resources for more information).

The summary of this survfit object will give what is called a life table. For each time step of the follow-up (time) where an event happened (in ascending order):

  • the number of people who were at risk of developing the event (people who did not have the event yet nor were censored: n.risk)
  • those who did develop the event (n.event)
  • and from the above: the probability of not developing the event (probability of not dying, or of surviving past that specific time)
  • finally, the standard error and the confidence interval for that probability are derived and displayed

We fit the KM estimates using the formula where the previously Surv object “survobj” is the response variable. “~ 1” precises we run the model for the overall survival.

# fit the KM estimates using a formula where the Surv object "survobj" is the response variable.
# "~ 1" signifies that we run the model for the overall survival  
linelistsurv_fit <-  survival::survfit(survobj ~ 1)

#print its summary for more details
summary(linelistsurv_fit)
## Call: survfit(formula = survobj ~ 1)
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1   4539      30    0.993 0.00120        0.991        0.996
##     2   4500      69    0.978 0.00217        0.974        0.982
##     3   4394     149    0.945 0.00340        0.938        0.952
##     4   4176     194    0.901 0.00447        0.892        0.910
##     5   3899     214    0.852 0.00535        0.841        0.862
##     6   3592     210    0.802 0.00604        0.790        0.814
##     7   3223     179    0.757 0.00656        0.745        0.770
##     8   2899     167    0.714 0.00700        0.700        0.728
##     9   2593     145    0.674 0.00735        0.660        0.688
##    10   2311     109    0.642 0.00761        0.627        0.657
##    11   2081     119    0.605 0.00788        0.590        0.621
##    12   1843      89    0.576 0.00809        0.560        0.592
##    13   1608      55    0.556 0.00823        0.540        0.573
##    14   1448      43    0.540 0.00837        0.524        0.556
##    15   1296      31    0.527 0.00848        0.511        0.544
##    16   1152      48    0.505 0.00870        0.488        0.522
##    17   1002      29    0.490 0.00886        0.473        0.508
##    18    898      21    0.479 0.00900        0.462        0.497
##    19    798       7    0.475 0.00906        0.457        0.493
##    20    705       4    0.472 0.00911        0.454        0.490
##    21    626      13    0.462 0.00932        0.444        0.481
##    22    546       8    0.455 0.00948        0.437        0.474
##    23    481       5    0.451 0.00962        0.432        0.470
##    24    436       4    0.447 0.00975        0.428        0.466
##    25    378       4    0.442 0.00993        0.423        0.462
##    26    336       3    0.438 0.01010        0.419        0.458
##    27    297       1    0.436 0.01017        0.417        0.457
##    29    235       1    0.435 0.01030        0.415        0.455
##    38     73       1    0.429 0.01175        0.406        0.452

While using summary() we can add the option times and specify certain times at which we want to see the survival information

#print its summary at specific times
summary(linelistsurv_fit, times = c(5,10,20,30,60))
## Call: survfit(formula = survobj ~ 1)
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     5   3899     656    0.852 0.00535        0.841        0.862
##    10   2311     810    0.642 0.00761        0.627        0.657
##    20    705     446    0.472 0.00911        0.454        0.490
##    30    210      39    0.435 0.01030        0.415        0.455
##    60      2       1    0.429 0.01175        0.406        0.452

We can also use the print() function. The print.rmean = TRUE argument is used to obtain the mean survival time and its standard error (se).

NOTE: The restricted mean survival time (RMST) is a specific survival measure more and more used in cancer survival analysis and which is often defined as the area under the survival curve, given we observe patients up to restricted time T (more details in Resources section).

# print linelistsurv_fit object with mean survival time and its se. 
print(linelistsurv_fit, print.rmean = TRUE)
## Call: survfit(formula = survobj ~ 1)
## 
##          n     events     *rmean *se(rmean)     median    0.95LCL    0.95UCL 
##   4539.000   1952.000     33.105      0.539     17.000     16.000     18.000 
##     * restricted mean with upper limit =  64

TIP: We can create the surv object directly in the survfit() function and save a line of code. This will then look like: linelistsurv_quick <- survfit(Surv(futime, event) ~ 1, data=linelist_surv).

Cumulative hazard

Besides the summary() function, we can also use the str() function that gives more details on the structure of the survfit() object. It is a list of 16 elements.

Among these elements is an important one: cumhaz, which is a numeric vector. This could be plotted to allow show the cumulative hazard, with the hazard being the instantaneous rate of event occurrence (see references).

str(linelistsurv_fit)
## List of 16
##  $ n        : int 4539
##  $ time     : num [1:59] 1 2 3 4 5 6 7 8 9 10 ...
##  $ n.risk   : num [1:59] 4539 4500 4394 4176 3899 ...
##  $ n.event  : num [1:59] 30 69 149 194 214 210 179 167 145 109 ...
##  $ n.censor : num [1:59] 9 37 69 83 93 159 145 139 137 121 ...
##  $ surv     : num [1:59] 0.993 0.978 0.945 0.901 0.852 ...
##  $ std.err  : num [1:59] 0.00121 0.00222 0.00359 0.00496 0.00628 ...
##  $ cumhaz   : num [1:59] 0.00661 0.02194 0.05585 0.10231 0.15719 ...
##  $ std.chaz : num [1:59] 0.00121 0.00221 0.00355 0.00487 0.00615 ...
##  $ type     : chr "right"
##  $ logse    : logi TRUE
##  $ conf.int : num 0.95
##  $ conf.type: chr "log"
##  $ lower    : num [1:59] 0.991 0.974 0.938 0.892 0.841 ...
##  $ upper    : num [1:59] 0.996 0.982 0.952 0.91 0.862 ...
##  $ call     : language survfit(formula = survobj ~ 1)
##  - attr(*, "class")= chr "survfit"

Plotting Kaplan-Meir curves

Once the KM estimates are fitted, we can visualize the probability of being alive through a given time using the basic plot() function that draws the “Kaplan-Meier curve”. In other words, the curve below is a conventional illustration of the survival experience in the whole patient group.

We can quickly verify the follow-up time min and max on the curve.

An easy way to interpret is to say that at time zero, all the participants are still alive and survival probability is then 100%. This probability decreases over time as patients die. The proportion of participants surviving past 60 days of follow-up is around 40%.

plot(linelistsurv_fit, 
     xlab = "Days of follow-up",    # x-axis label
     ylab="Survival Probability",   # y-axis label
     main= "Overall survival curve" # figure title
     )

The confidence interval of the KM survival estimates are also plotted by default and can be dismissed by adding the option conf.int = FALSE to the plot() command.

Since the event of interest is “death”, drawing a curve describing the complements of the survival proportions will lead to drawing the cumulative mortality proportions. This can be done with lines(), which adds information to an existing plot.

# original plot
plot(
  linelistsurv_fit,
  xlab = "Days of follow-up",       
  ylab = "Survival Probability",       
  mark.time = TRUE,              # mark events on the curve: a "+" is printed at every event
  conf.int = FALSE,              # do not plot the confidence interval
  main = "Overall survival curve and cumulative mortality"
  )

# draw an additional curve to the previous plot
lines(
  linelistsurv_fit,
  lty = 3,             # use different line type for clarity
  fun = "event",       # draw the cumulative events instead of the survival 
  mark.time = FALSE,
  conf.int = FALSE
  )

# add a legend to the plot
legend(
  "topright",                               # position of legend
  legend = c("Survival", "Cum. Mortality"), # legend text 
  lty = c(1, 3),                            # line types to use in the legend
  cex = .85,                                # parametes that defines size of legend text
  bty = "n"                                 # no box type to be drawn for the legend
  )

27.4 Comparison of survival curves

To compare the survival within different groups of our observed participants or patients, we might need to first look at their respective survival curves and then run tests to evaluate the difference between independent groups. This comparison can concern groups based on gender, age, treatment, comorbidity…

Log rank test

The log rank test is a popular test that compares the entire survival experience between two or more independent groups and can be thought of as a test of whether the survival curves are identical (overlapping) or not (null hypothesis of no difference in survival between the groups). The survdiff() function of the survival package allows running the log-rank test when we specify rho = 0 (which is the default). The test results gives a chi-square statistic along with a p-value since the log rank statistic is approximately distributed as a chi-square test statistic.

We first try to compare the survival curves by gender group. For this, we first try to visualize it (check whether the two survival curves are overlapping). A new survfit object will be created with a slightly different formula. Then the survdiff object will be created.

By supplying ~ gender as the right side of the formula, we no longer plot the overall survival but instead by gender.

# create the new survfit object based on gender
linelistsurv_fit_sex <-  survfit(Surv(futime, event) ~ gender, data = linelist_surv)

Now we can plot the survival curves by gender. Have a look at the order of the strata levels in the gender column before defining your colors and legend.

# set colors
col_sex <- c("lightgreen", "darkgreen")

# create plot
plot(
  linelistsurv_fit_sex,
  col = col_sex,
  xlab = "Days of follow-up",
  ylab = "Survival Probability")

# add legend
legend(
  "topright",
  legend = c("Female","Male"),
  col = col_sex,
  lty = 1,
  cex = .9,
  bty = "n")

And now we can compute the test of the difference between the survival curves using survdiff()

#compute the test of the difference between the survival curves
survival::survdiff(
  Surv(futime, event) ~ gender, 
  data = linelist_surv
  )
## Call:
## survival::survdiff(formula = Surv(futime, event) ~ gender, data = linelist_surv)
## 
## n=4321, 218 observations deleted due to missingness.
## 
##             N Observed Expected (O-E)^2/E (O-E)^2/V
## gender=f 2156      924      909     0.255     0.524
## gender=m 2165      929      944     0.245     0.524
## 
##  Chisq= 0.5  on 1 degrees of freedom, p= 0.5

We see that the survival curve for women and the one for men overlap and the log-rank test does not give evidence of a survival difference between women and men.

Some other R packages allow illustrating survival curves for different groups and testing the difference all at once. Using the ggsurvplot() function from the survminer package, we can also include in our curve the printed risk tables for each group, as well the p-value from the log-rank test.

CAUTION: survminer functions require that you specify the survival object and again specify the data used to fit the survival object. Remember to do this to avoid non-specific error messages.

survminer::ggsurvplot(
    linelistsurv_fit_sex, 
    data = linelist_surv,          # again specify the data used to fit linelistsurv_fit_sex 
    conf.int = FALSE,              # do not show confidence interval of KM estimates
    surv.scale = "percent",        # present probabilities in the y axis in %
    break.time.by = 10,            # present the time axis with an increment of 10 days
    xlab = "Follow-up days",
    ylab = "Survival Probability",
    pval = T,                      # print p-value of Log-rank test 
    pval.coord = c(40,.91),        # print p-value at these plot coordinates
    risk.table = T,                # print the risk table at bottom 
    legend.title = "Gender",       # legend characteristics
    legend.labs = c("Female","Male"),
    font.legend = 10, 
    palette = "Dark2",             # specify color palette 
    surv.median.line = "hv",       # draw horizontal and vertical lines to the median survivals
    ggtheme = theme_light()        # simplify plot background
)

We may also want to test for differences in survival by the source of infection (source of contamination).

In this case, the Log rank test gives enough evidence of a difference in the survival probabilities at alpha= 0.005. The survival probabilities for patients that were infected at funerals are higher than the survival probabilities for patients that got infected in other places, suggesting a survival benefit.

linelistsurv_fit_source <-  survfit(
  Surv(futime, event) ~ source,
  data = linelist_surv
  )

# plot
ggsurvplot( 
  linelistsurv_fit_source,
  data = linelist_surv,
  size = 1, linetype = "strata",   # line types
  conf.int = T,
  surv.scale = "percent",  
  break.time.by = 10, 
  xlab = "Follow-up days",
  ylab= "Survival Probability",
  pval = T,
  pval.coord = c(40,.91),
  risk.table = T,
  legend.title = "Source of \ninfection",
  legend.labs = c("Funeral", "Other"),
  font.legend = 10,
  palette = c("#E7B800","#3E606F"),
  surv.median.line = "hv", 
  ggtheme = theme_light()
)

27.5 Cox regression analysis

Cox proportional hazards regression is one of the most popular regression techniques for survival analysis. Other models can also be used since the Cox model requires important assumptions that need to be verified for an appropriate use such as the proportional hazards assumption: see references.

In a Cox proportional hazards regression model, the measure of effect is the hazard rate (HR), which is the risk of failure (or the risk of death in our example), given that the participant has survived up to a specific time. Usually, we are interested in comparing independent groups with respect to their hazards, and we use a hazard ratio, which is analogous to an odds ratio in the setting of multiple logistic regression analysis. The cox.ph() function from the survival package is used to fit the model. The function cox.zph() from survival package may be used to test the proportional hazards assumption for a Cox regression model fit.

NOTE: A probability must lie in the range 0 to 1. However, the hazard represents the expected number of events per one unit of time.

  • If the hazard ratio for a predictor is close to 1 then that predictor does not affect survival,
  • if the HR is less than 1, then the predictor is protective (i.e., associated with improved survival),
  • and if the HR is greater than 1, then the predictor is associated with increased risk (or decreased survival).

Fitting a Cox model

We can first fit a model to assess the effect of age and gender on the survival. By just printing the model, we have the information on:

  • the estimated regression coefficients coef which quantifies the association between the predictors and the outcome,
  • their exponential (for interpretability, exp(coef)) which produces the hazard ratio,
  • their standard error se(coef),
  • the z-score: how many standard errors is the estimated coefficient away from 0,
  • and the p-value: the probability that the estimated coefficient could be 0.

The summary() function applied to the cox model object gives more information, such as the confidence interval of the estimated HR and the different test scores.

The effect of the first covariate gender is presented in the first row. genderm (male) is printed, implying that the first strata level (“f”), i.e the female group, is the reference group for the gender. Thus the interpretation of the test parameter is that of men compared to women. The p-value indicates there was not enough evidence of an effect of the gender on the expected hazard or of an association between gender and all-cause mortality.

The same lack of evidence is noted regarding age-group.

#fitting the cox model
linelistsurv_cox_sexage <-  survival::coxph(
              Surv(futime, event) ~ gender + age_cat_small, 
              data = linelist_surv
              )


#printing the model fitted
linelistsurv_cox_sexage
## Call:
## survival::coxph(formula = Surv(futime, event) ~ gender + age_cat_small, 
##     data = linelist_surv)
## 
##                       coef exp(coef) se(coef)      z     p
## genderm           -0.03149   0.96900  0.04767 -0.661 0.509
## age_cat_small5-19  0.09400   1.09856  0.06454  1.456 0.145
## age_cat_small20+   0.05032   1.05161  0.06953  0.724 0.469
## 
## Likelihood ratio test=2.8  on 3 df, p=0.4243
## n= 4321, number of events= 1853 
##    (218 observations deleted due to missingness)
#summary of the model
summary(linelistsurv_cox_sexage)
## Call:
## survival::coxph(formula = Surv(futime, event) ~ gender + age_cat_small, 
##     data = linelist_surv)
## 
##   n= 4321, number of events= 1853 
##    (218 observations deleted due to missingness)
## 
##                       coef exp(coef) se(coef)      z Pr(>|z|)
## genderm           -0.03149   0.96900  0.04767 -0.661    0.509
## age_cat_small5-19  0.09400   1.09856  0.06454  1.456    0.145
## age_cat_small20+   0.05032   1.05161  0.06953  0.724    0.469
## 
##                   exp(coef) exp(-coef) lower .95 upper .95
## genderm               0.969     1.0320    0.8826     1.064
## age_cat_small5-19     1.099     0.9103    0.9680     1.247
## age_cat_small20+      1.052     0.9509    0.9176     1.205
## 
## Concordance= 0.514  (se = 0.007 )
## Likelihood ratio test= 2.8  on 3 df,   p=0.4
## Wald test            = 2.78  on 3 df,   p=0.4
## Score (logrank) test = 2.78  on 3 df,   p=0.4

It was interesting to run the model and look at the results but a first look to verify whether the proportional hazards assumptions is respected could help saving time.

test_ph_sexage <- survival::cox.zph(linelistsurv_cox_sexage)
test_ph_sexage
##               chisq df    p
## gender        0.454  1 0.50
## age_cat_small 0.838  2 0.66
## GLOBAL        1.399  3 0.71

NOTE: A second argument called method can be specified when computing the cox model, that determines how ties are handled. The default is “efron”, and the other options are “breslow” and “exact”.

In another model we add more risk factors such as the source of infection and the number of days between date of onset and admission. This time, we first verify the proportional hazards assumption before going forward.

In this model, we have included a continuous predictor (days_onset_hosp). In this case we interpret the parameter estimates as the increase in the expected log of the relative hazard for each one unit increase in the predictor, holding other predictors constant. We first verify the proportional hazards assumption.

#fit the model
linelistsurv_cox <-  coxph(
                        Surv(futime, event) ~ gender + age_years+ source + days_onset_hosp,
                        data = linelist_surv
                        )


#test the proportional hazard model
linelistsurv_ph_test <- cox.zph(linelistsurv_cox)
linelistsurv_ph_test
##                    chisq df       p
## gender           0.45062  1    0.50
## age_years        0.00199  1    0.96
## source           1.79622  1    0.18
## days_onset_hosp 31.66167  1 1.8e-08
## GLOBAL          34.08502  4 7.2e-07

The graphical verification of this assumption may be performed with the function ggcoxzph() from the survminer package.

survminer::ggcoxzph(linelistsurv_ph_test)

The model results indicate there is a negative association between onset to admission duration and all-cause mortality. The expected hazard is 0.9 times lower in a person who who is one day later admitted than another, holding gender constant. Or in a more straightforward explanation, a one unit increase in the duration of onset to admission is associated with a 10.7% (coef *100) decrease in the risk of death.

Results show also a positive association between the source of infection and the all-cause mortality. Which is to say there is an increased risk of death (1.21x) for patients that got a source of infection other than funerals.

#print the summary of the model
summary(linelistsurv_cox)
## Call:
## coxph(formula = Surv(futime, event) ~ gender + age_years + source + 
##     days_onset_hosp, data = linelist_surv)
## 
##   n= 2772, number of events= 1180 
##    (1767 observations deleted due to missingness)
## 
##                      coef exp(coef)  se(coef)      z Pr(>|z|)    
## genderm          0.004710  1.004721  0.060827  0.077   0.9383    
## age_years       -0.002249  0.997753  0.002421 -0.929   0.3528    
## sourceother      0.178393  1.195295  0.084291  2.116   0.0343 *  
## days_onset_hosp -0.104063  0.901169  0.014245 -7.305 2.77e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                 exp(coef) exp(-coef) lower .95 upper .95
## genderm            1.0047     0.9953    0.8918    1.1319
## age_years          0.9978     1.0023    0.9930    1.0025
## sourceother        1.1953     0.8366    1.0133    1.4100
## days_onset_hosp    0.9012     1.1097    0.8764    0.9267
## 
## Concordance= 0.566  (se = 0.009 )
## Likelihood ratio test= 71.31  on 4 df,   p=1e-14
## Wald test            = 59.22  on 4 df,   p=4e-12
## Score (logrank) test = 59.54  on 4 df,   p=4e-12

We can verify this relationship with a table:

linelist_case_data %>% 
  tabyl(days_onset_hosp, outcome) %>% 
  adorn_percentages() %>%  
  adorn_pct_formatting()
##  days_onset_hosp Death Recover   NA_
##                0 44.3%   31.4% 24.3%
##                1 46.6%   32.2% 21.2%
##                2 43.0%   32.8% 24.2%
##                3 45.0%   32.3% 22.7%
##                4 41.5%   38.3% 20.2%
##                5 40.0%   36.2% 23.8%
##                6 32.2%   48.7% 19.1%
##                7 31.8%   38.6% 29.5%
##                8 29.8%   38.6% 31.6%
##                9 30.3%   51.5% 18.2%
##               10 16.7%   58.3% 25.0%
##               11 36.4%   45.5% 18.2%
##               12 18.8%   62.5% 18.8%
##               13 10.0%   60.0% 30.0%
##               14 10.0%   50.0% 40.0%
##               15 28.6%   42.9% 28.6%
##               16 20.0%   80.0%  0.0%
##               17  0.0%  100.0%  0.0%
##               18  0.0%  100.0%  0.0%
##               22  0.0%  100.0%  0.0%
##               NA 52.7%   31.2% 16.0%

We would need to consider and investigate why this association exists in the data. One possible explanation could be that patients who live long enough to be admitted later had less severe disease to begin with. Another perhaps more likely explanation is that since we used a simulated fake dataset, this pattern does not reflect reality!

Forest plots

We can then visualize the results of the cox model using the practical forest plots with the ggforest() function of the survminer package.

ggforest(linelistsurv_cox, data = linelist_surv)

27.6 Time-dependent covariates in survival models

Some of the following sections have been adapted with permission from an excellent introduction to survival analysis in R by Dr. Emily Zabor

In the last section we covered using Cox regression to examine associations between covariates of interest and survival outcomes.But these analyses rely on the covariate being measured at baseline, that is, before follow-up time for the event begins.

What happens if you are interested in a covariate that is measured after follow-up time begins? Or, what if you have a covariate that can change over time?

For example, maybe you are working with clinical data where you repeated measures of hospital laboratory values that can change over time. This is an example of a Time Dependent Covariate. In order to address this you need a special setup, but fortunately the cox model is very flexible and this type of data can also be modeled with tools from the survival package.

Time-dependent covariate setup

Analysis of time-dependent covariates in R requires setup of a special dataset. If interested, see the more detailed paper on this by the author of the survival package Using Time Dependent Covariates and Time Dependent Coefficients in the Cox Model.

For this, we’ll use a new dataset from the SemiCompRisks package named BMT, which includes data on 137 bone marrow transplant patients. The variables we’ll focus on are:

  • T1 - time (in days) to death or last follow-up
  • delta1 - death indicator; 1-Dead, 0-Alive
  • TA - time (in days) to acute graft-versus-host disease
  • deltaA - acute graft-versus-host disease indicator;
    • 1 - Developed acute graft-versus-host disease
    • 0 - Never developed acute graft-versus-host disease

We’ll load this dataset from the survival package using the base R command data(), which can be used for loading data that is already included in a R package that is loaded. The data frame BMT will appear in your R environment.

data(BMT, package = "SemiCompRisks")

Add unique patient identifier

There is no unique ID column in the BMT data, which is needed to create the type of dataset we want. So we use the function rowid_to_column() from the tidyverse package tibble to create a new id column called my_id (adds column at start of data frame with sequential row ids, starting at 1). We name the data frame bmt.

bmt <- rowid_to_column(BMT, "my_id")

The dataset now looks like this:

Expand patient rows

Next, we’ll use the tmerge() function with the event() and tdc() helper functions to create the restructured dataset. Our goal is to restructure the dataset to create a separate row for each patient for each time interval where they have a different value for deltaA. In this case, each patient can have at most two rows depending on whether they developed acute graft-versus-host disease during the data collection period. We’ll call our new indicator for the development of acute graft-versus-host disease agvhd.

  • tmerge() creates a long dataset with multiple time intervals for the different covariate values for each patient
  • event() creates the new event indicator to go with the newly-created time intervals
  • tdc() creates the time-dependent covariate column, agvhd, to go with the newly created time intervals
td_dat <- 
  tmerge(
    data1 = bmt %>% select(my_id, T1, delta1), 
    data2 = bmt %>% select(my_id, T1, delta1, TA, deltaA), 
    id = my_id, 
    death = event(T1, delta1),
    agvhd = tdc(TA)
    )

To see what this does, let’s look at the data for the first 5 individual patients.

The variables of interest in the original data looked like this:

bmt %>% 
  select(my_id, T1, delta1, TA, deltaA) %>% 
  filter(my_id %in% seq(1, 5))
##   my_id   T1 delta1   TA deltaA
## 1     1 2081      0   67      1
## 2     2 1602      0 1602      0
## 3     3 1496      0 1496      0
## 4     4 1462      0   70      1
## 5     5 1433      0 1433      0

The new dataset for these same patients looks like this:

td_dat %>% 
  filter(my_id %in% seq(1, 5))
##   my_id   T1 delta1 tstart tstop death agvhd
## 1     1 2081      0      0    67     0     0
## 2     1 2081      0     67  2081     0     1
## 3     2 1602      0      0  1602     0     0
## 4     3 1496      0      0  1496     0     0
## 5     4 1462      0      0    70     0     0
## 6     4 1462      0     70  1462     0     1
## 7     5 1433      0      0  1433     0     0

Now some of our patients have two rows in the dataset corresponding to intervals where they have a different value of our new variable, agvhd. For example, Patient 1 now has two rows with a agvhd value of zero from time 0 to time 67, and a value of 1 from time 67 to time 2081.

Cox regression with time-dependent covariates

Now that we’ve reshaped our data and added the new time-dependent aghvd variable, let’s fit a simple single variable cox regression model. We can use the same coxph() function as before, we just need to change our Surv() function to specify both the start and stop time for each interval using the time1 = and time2 = arguments.

bmt_td_model = coxph(
  Surv(time = tstart, time2 = tstop, event = death) ~ agvhd, 
  data = td_dat
  )

summary(bmt_td_model)
## Call:
## coxph(formula = Surv(time = tstart, time2 = tstop, event = death) ~ 
##     agvhd, data = td_dat)
## 
##   n= 163, number of events= 80 
## 
##         coef exp(coef) se(coef)    z Pr(>|z|)
## agvhd 0.3351    1.3980   0.2815 1.19    0.234
## 
##       exp(coef) exp(-coef) lower .95 upper .95
## agvhd     1.398     0.7153    0.8052     2.427
## 
## Concordance= 0.535  (se = 0.024 )
## Likelihood ratio test= 1.33  on 1 df,   p=0.2
## Wald test            = 1.42  on 1 df,   p=0.2
## Score (logrank) test = 1.43  on 1 df,   p=0.2

Again, we’ll visualize our cox model results using the ggforest() function from the survminer package.:

ggforest(bmt_td_model, data = td_dat)

As you can see from the forest plot, confidence interval, and p-value, there does not appear to be a strong association between death and acute graft-versus-host disease in the context of our simple model.

28 GIS basics

28.1 Overview

Spatial aspects of your data can provide a lot of insights into the situation of the outbreak, and to answer questions such as:

  • Where are the current disease hotspots?
  • How have the hotspots have changed over time?
  • How is the access to health facilities? Are any improvements needed?

The current focus of this GIS page to address the needs of applied epidemiologists in outbreak response. We will explore basic spatial data visualization methods using tmap and ggplot2 packages. We will also walk through some of the basic spatial data management and querying methods with the sf package. Lastly, we will briefly touch upon concepts of spatial statistics such as spatial relationships, spatial autocorrelation, and spatial regression using the spdep package.

28.2 Key terms

Below we introduce some key terminology. For a thorough introduction to GIS and spatial analysis, we suggest that you review one of the longer tutorials or courses listed in the References section.

Geographic Information System (GIS) - A GIS is a framework or environment for gathering, managing, analyzing, and visualizing spatial data.

GIS software

Some popular GIS software allow point-and-click interaction for map development and spatial analysis. These tools comes with advantages such as not needing to learn code and the ease of manually selecting and placing icons and features on a map. Here are two popular ones:

ArcGIS - A commercial GIS software developed by the company ESRI, which is very popular but quite expensive

QGIS - A free open-source GIS software that can do almost anything that ArcGIS can do. You can download QGIS here

Using R as a GIS can seem more intimidating at first because instead of “point-and-click”, it has a “command-line interface” (you must code to acquire the desired outcome). However, this is a major advantage if you need to repetitively produce maps or create an analysis that is reproducible.

Spatial data

The two primary forms of spatial data used in GIS are vector and raster data:

Vector Data - The most common format of spatial data used in GIS, vector data are comprised of geometric features of vertices and paths. Vector spatial data can be further divided into three widely-used types:

  • Points - A point consists of a coordinate pair (x,y) representing a specific location in a coordinate system. Points are the most basic form of spatial data, and may be used to denote a case (i.e. patient home) or a location (i.e. hospital) on a map.

  • Lines - A line is composed of two connected points. Lines have a length, and may be used to denote things like roads or rivers.

  • Polygons - A polygon is composed of at least three line segments connected by points. Polygon features have a length (i.e. the perimeter of the area) as well as an area measurement. Polygons may be used to note an area (i.e. a village) or a structure (i.e. the actual area of a hospital).

Raster Data - An alternative format for spatial data, raster data is a matrix of cells (e.g. pixels) with each cell containing information such as height, temperature, slope, forest cover, etc. These are often aerial photographs, satellite imagery, etc. Rasters can also be used as “base maps” below vector data.

Visualizing spatial data

To visually represent spatial data on a map, GIS software requires you to provide sufficient information about where different features should be, in relation to one another. If you are using vector data, which will be true for most use cases, this information will typically be stored in a shapefile:

Shapefiles - A shapefile is a common data format for storing “vector” spatial data consisting or lines, points, or polygons. A single shapefile is actually a collection of at least three files - .shp, .shx, and .dbf. All of these sub-component files must be present in a given directory (folder) for the shapefile to be readable. These associated files can be compressed into a ZIP folder to be sent via email or download from a website.

The shapefile will contain information about the features themselves, as well as where to locate them on the Earth’s surface. This is important because while the Earth is a globe, maps are typically two-dimensional; choices about how to “flatten” spatial data can have a big impact on the look and interpretation of the resulting map.

Coordinate Reference Systems (CRS) - A CRS is a coordinate-based system used to locate geographical features on the Earth’s surface. It has a few key components:

  • Coordinate System - There are many many different coordinate systems, so make sure you know which system your coordinates are from. Degrees of latitude/longitude are common, but you could also see UTM coordinates.

  • Units - Know what the units are for your coordinate system (e.g. decimal degrees, meters)

  • Datum - A particular modeled version of the Earth. These have been revised over the years, so ensure that your map layers are using the same datum.

  • Projection - A reference to the mathematical equation that was used to project the truly round earth onto a flat surface (map).

Remember that you can summarise spatial data without using the mapping tools shown below. Sometimes a simple table by geography (e.g. district, country, etc.) is all that is needed!

28.3 Getting started with GIS

There are a couple of key items you will need to have and to think about to make a map. These include:

  • A dataset – this can be in a spatial data format (such as shapefiles, as noted above) or it may not be in a spatial format (for instance just as a csv).

  • If your dataset is not in a spatial format you will also need a reference dataset. Reference data consists of the spatial representation of the data and the related attributes, which would include material containing the location and address information of specific features.

    • If you are working with pre-defined geographic boundaries (for example, administrative regions), reference shapefiles are often freely available to download from a government agency or data sharing organization. When in doubt, a good place to start is to Google “[regions] shapefile”

    • If you have address information, but no latitude and longitude, you may need to use a geocoding engine to get the spatial reference data for your records.

  • An idea about how you want to present the information in your datasets to your target audience. There are many different types of maps, and it is important to think about which type of map best fits your needs.

Types of maps for visualizing your data

Choropleth map - a type of thematic map where colors, shading, or patterns are used to represent geographic regions in relation to their value of an attribute. For instance a larger value could be indicated by a darker colour than a smaller value. This type of map is particularly useful when visualizing a variable and how it changes across defined regions or geopolitical areas.

Case density heatmap - a type of thematic map where colours are used to represent intensity of a value, however, it does not use defined regions or geopolitical boundaries to group data. This type of map is typically used for showing ‘hot spots’ or areas with a high density or concentration of points.

Dot density map - a thematic map type that uses dots to represent attribute values in your data. This type of map is best used to visualize the scatter of your data and visually scan for clusters.

Proportional symbols map (graduated symbols map) - a thematic map similar to a choropleth map, but instead of using colour to indicate the value of an attribute it uses a symbol (usually a circle) in relation to the value. For instance a larger value could be indicated by a larger symbol than a smaller value. This type of map is best used when you want to visualize the size or quantity of your data across geographic regions.

You can also combine several different types of visualizations to show complex geographic patterns. For example, the cases (dots) in the map below are colored according to their closest health facility (see legend). The large red circles show health facility catchment areas of a certain radius, and the bright red case-dots those that were outside any catchment range:

Note: The primary focus of this GIS page is based on the context of field outbreak response. Therefore the contents of the page will cover the basic spatial data manipulations, visualizations, and analyses.

28.4 Preparation

Load packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
  rio,           # to import data
  here,          # to locate files
  tidyverse,     # to clean, handle, and plot the data (includes ggplot2 package)
  sf,            # to manage spatial data using a Simple Feature format
  tmap,          # to produce simple maps, works for both interactive and static maps
  janitor,       # to clean column names
  OpenStreetMap, # to add OSM basemap in ggplot map
  spdep          # spatial statistics
  ) 

You can see an overview of all the R packages that deal with spatial data at the CRAN “Spatial Task View”.

Sample case data

For demonstration purposes, we will work with a random sample of 1000 cases from the simulated Ebola epidemic linelist dataframe (computationally, working with fewer cases is easier to display in this handbook). If you want to follow along, click to download the “clean” linelist (as .rds file).

Since we are taking a random sample of the cases, your results may look slightly different from what is demonstrated here when you run the codes on your own.

Import data with the import() function from the rio package (it handles many file types like .xlsx, .csv, .rds - see the Import and export page for details).

# import clean case linelist
linelist <- import("linelist_cleaned.rds")  

Next we select a random sample of 1000 rows using sample() from base R.

# generate 1000 random row numbers, from the number of rows in linelist
sample_rows <- sample(nrow(linelist), 1000)

# subset linelist to keep only the sample rows, and all columns
linelist <- linelist[sample_rows,]

Now we want to convert this linelist which is class dataframe, to an object of class “sf” (spatial features). Given that the linelist has two columns “lon” and “lat” representing the longitude and latitude of each case’s residence, this will be easy.

We use the package sf (spatial features) and its function st_as_sf() to create the new object we call linelist_sf. This new object looks essentially the same as the linelist, but the columns lon and lat have been designated as coordinate columns, and a coordinate reference system (CRS) has been assigned for when the points are displayed. 4326 identifies our coordinates as based on the World Geodetic System 1984 (WGS84) - which is standard for GPS coordinates.

# Create sf object
linelist_sf <- linelist %>%
     sf::st_as_sf(coords = c("lon", "lat"), crs = 4326)

This is how the original linelist dataframe looks like. In this demonstration, we will only use the column date_onset and geometry (which was constructed from the longitude and latitude fields above and is the last column in the data frame).

DT::datatable(head(linelist_sf, 10), rownames = FALSE, options = list(pageLength = 5, scrollX=T), class = 'white-space: nowrap' )

Admin boundary shapefiles

Sierra Leone: Admin boundary shapefiles

In advance, we have downloaded all administrative boundaries for Sierra Leone from the Humanitarian Data Exchange (HDX) website here. Alternatively, you can download these and all other example data for this handbook via our R package, as explained in the Download handbook and data page.

Now we are going to do the following to save the Admin Level 3 shapefile in R:

  1. Import the shapefile
  2. Clean the column names
  3. Filter rows to keep only areas of interest

To import a shapefile we use the read_sf() function from sf. It is provided the filepath via here(). - in our case the file is within our R project in the “data”, “gis”, and “shp” subfolders, with filename “sle_adm3.shp” (see pages on Import and export and R projects for more information). You will need to provide your own file path.

Next we use clean_names() from the janitor package to standardize the column names of the shapefile. We also use filter() to keep only the rows with admin2name of “Western Area Urban” or “Western Area Rural”.

# ADM3 level clean
sle_adm3 <- sle_adm3_raw %>%
  clean_names() %>% # standardize column names
  filter(admin2name %in% c("Western Area Urban", "Western Area Rural")) # filter to keep certain areas

Below you can see the how the shapefile looks after import and cleaning. Scroll to the right to see how there are columns with admin level 0 (country), admin level 1, admin level 2, and finally admin level 3. Each level has a character name and a unique identifier “pcode”. The pcode expands with each increasing admin level e.g. SL (Sierra Leone) -> SL04 (Western) -> SL0410 (Western Area Rural) -> SL040101 (Koya Rural).

Population data

Sierra Leone: Population by ADM3

These data can again be downloaded from HDX (link here) or via our epirhandbook R package as explained in this page. We use import() to load the .csv file. We also pass the imported file to clean_names() to standardize the column name syntax.

# Population by ADM3
sle_adm3_pop <- import(here("data", "gis", "population", "sle_admpop_adm3_2020.csv")) %>%
  clean_names()

Here is what the population file looks like. Scroll to the right to see how each jurisdiction has columns with male population, female populaton, total population, and the population break-down in columns by age group.

Health Facilities

Sierra Leone: Health facility data from OpenStreetMap

Again we have downloaded the locations of health facilities from HDX here or via instructions in the Download handbook and data page.

We import the facility points shapefile with read_sf(), again clean the column names, and then filter to keep only the points tagged as either “hospital”, “clinic”, or “doctors”.

# OSM health facility shapefile
sle_hf <- sf::read_sf(here("data", "gis", "shp", "sle_hf.shp")) %>% 
  clean_names() %>%
  filter(amenity %in% c("hospital", "clinic", "doctors"))

Here is the resulting dataframe - scroll right to see the facility name and geometry coordinates.

28.5 Plotting coordinates

The easiest way to plot X-Y coordinates (longitude/latitude, points), in this case of cases, is to draw them as points directly from the linelist_sf object which we created in the preparation section.

The package tmap offers simple mapping capabilities for both static (“plot” mode) and interactive (“view” mode) with just a few lines of code. The tmap syntax is similar to that of ggplot2, such that commands are added to each other with +. Read more detail in this vignette.

  1. Set the tmap mode. In this case we will use “plot” mode, which produces static outputs.
tmap_mode("plot") # choose either "view" or "plot"

Below, the points are plotted alone.tm_shape() is provided with the linelist_sf objects. We then add points via tm_dots(), specifying the size and color. Because linelist_sf is an sf object, we have already designated the two columns that contain the lat/long coordinates and the coordinate reference system (CRS):

# Just the cases (points)
tm_shape(linelist_sf) + tm_dots(size=0.08, col='blue')

Alone, the points do not tell us much. So we should also map the administrative boundaries:

Again we use tm_shape() (see documentation) but instead of providing the case points shapefile, we provide the administrative boundary shapefile (polygons).

With the bbox = argument (bbox stands for “bounding box”) we can specify the coordinate boundaries. First we show the map display without bbox, and then with it.

# Just the administrative boundaries (polygons)
tm_shape(sle_adm3) +               # admin boundaries shapefile
  tm_polygons(col = "#F7F7F7")+    # show polygons in light grey
  tm_borders(col = "#000000",      # show borders with color and line weight
             lwd = 2) +
  tm_text("admin3name")            # column text to display for each polygon


# Same as above, but with zoom from bounding box
tm_shape(sle_adm3,
         bbox = c(-13.3, 8.43,    # corner
                  -13.2, 8.5)) +  # corner
  tm_polygons(col = "#F7F7F7") +
  tm_borders(col = "#000000", lwd = 2) +
  tm_text("admin3name")

And now both points and polygons together:

# All together
tm_shape(sle_adm3, bbox = c(-13.3, 8.43, -13.2, 8.5)) +     #
  tm_polygons(col = "#F7F7F7") +
  tm_borders(col = "#000000", lwd = 2) +
  tm_text("admin3name")+
tm_shape(linelist_sf) +
  tm_dots(size=0.08, col='blue', alpha = 0.5) +
  tm_layout(title = "Distribution of Ebola cases")   # give title to map

To read a good comparison of mapping options in R, see this blog post.

28.6 Spatial joins

You may be familiar with joining data from one dataset to another one. Several methods are discussed in the Joining data page of this handbook. A spatial join serves a similar purpose but leverages spatial relationships. Instead of relying on common values in columns to correctly match observations, you can utilize their spatial relationships, such as one feature being within another, or the nearest neighbor to another, or within a buffer of a certain radius from another, etc.

The sf package offers various methods for spatial joins. See more documentation about the st_join method and spatial join types in this reference.

Points in polygon

Spatial assign administrative units to cases

Here is an interesting conundrum: the case linelist does not contain any information about the administrative units of the cases. Although it is ideal to collect such information during the initial data collection phase, we can also assign administrative units to individual cases based on their spatial relationships (i.e. point intersects with a polygon).

Below, we will spatially intersect our case locations (points) with the ADM3 boundaries (polygons):

  1. Begin with the linelist (points)
  2. Spatial join to the boundaries, setting the type of join at “st_intersects”
  3. Use select() to keep only certain of the new administrative boundary columns
linelist_adm <- linelist_sf %>%
  
  # join the administrative boundary file to the linelist, based on spatial intersection
  sf::st_join(sle_adm3, join = st_intersects)

All the columns from sle_adms have been added to the linelist! Each case now has columns detailing the administrative levels that it falls within. In this example, we only want to keep two of the new columns (admin level 3), so we select() the old column names and just the two additional of interest:

linelist_adm <- linelist_sf %>%
  
  # join the administrative boundary file to the linelist, based on spatial intersection
  sf::st_join(sle_adm3, join = st_intersects) %>% 
  
  # Keep the old column names and two new admin ones of interest
  select(names(linelist_sf), admin3name, admin3pcod)

Below, just for display purposes you can see the first ten cases and that their admin level 3 (ADM3) jurisdictions that have been attached, based on where the point spatially intersected with the polygon shapes.

# Now you will see the ADM3 names attached to each case
linelist_adm %>% select(case_id, admin3name, admin3pcod)
## Simple feature collection with 1000 features and 3 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -13.27095 ymin: 8.448085 xmax: -13.20522 ymax: 8.491748
## Geodetic CRS:  WGS 84
## First 10 features:
##      case_id     admin3name admin3pcod                   geometry
## 2401  dab451       West III   SL040208 POINT (-13.26343 8.482226)
## 842   2a438e       West III   SL040208 POINT (-13.26695 8.462632)
## 237   ee1b06       West III   SL040208 POINT (-13.25373 8.462149)
## 2146  dcea0f Mountain Rural   SL040102 POINT (-13.21596 8.461143)
## 4880  174e87 Mountain Rural   SL040102  POINT (-13.2232 8.477856)
## 5642  4be7f2       West III   SL040208 POINT (-13.25631 8.459439)
## 3504  3fad0c         East I   SL040203 POINT (-13.21706 8.488859)
## 3478  6cfedd Mountain Rural   SL040102 POINT (-13.21274 8.465191)
## 1478  069ea8       West III   SL040208 POINT (-13.26745 8.459997)
## 5309  6855bf         East I   SL040203 POINT (-13.21219 8.485044)

Now we can describe our cases by administrative unit - something we were not able to do before the spatial join!

# Make new dataframe containing counts of cases by administrative unit
case_adm3 <- linelist_adm %>%          # begin with linelist with new admin cols
  as_tibble() %>%                      # convert to tibble for better display
  group_by(admin3pcod, admin3name) %>% # group by admin unit, both by name and pcode 
  summarise(cases = n()) %>%           # summarize and count rows
  arrange(desc(cases))                     # arrange in descending order

case_adm3
## # A tibble: 10 x 3
## # Groups:   admin3pcod [10]
##    admin3pcod admin3name     cases
##    <chr>      <chr>          <int>
##  1 SL040102   Mountain Rural   279
##  2 SL040208   West III         218
##  3 SL040207   West II          180
##  4 SL040204   East II          105
##  5 SL040201   Central I         66
##  6 SL040203   East I            63
##  7 SL040206   West I            36
##  8 SL040202   Central II        27
##  9 SL040205   East III          24
## 10 <NA>       <NA>               2

We can also create a bar plot of case counts by administrative unit.

In this example, we begin the ggplot() with the linelist_adm, so that we can apply factor functions like fct_infreq() which orders the bars by frequency (see page on Factors for tips).

ggplot(
    data = linelist_adm,                       # begin with linelist containing admin unit info
    mapping = aes(
      x = fct_rev(fct_infreq(admin3name))))+ # x-axis is admin units, ordered by frequency (reversed)
  geom_bar()+                                # create bars, height is number of rows
  coord_flip()+                              # flip X and Y axes for easier reading of adm units
  theme_classic()+                           # simplify background
  labs(                                      # titles and labels
    x = "Admin level 3",
    y = "Number of cases",
    title = "Number of cases, by adminstative unit",
    caption = "As determined by a spatial join, from 1000 randomly sampled cases from linelist"
  )

Nearest neighbor

Finding the nearest health facility / catchment area

It might be useful to know where the health facilities are located in relation to the disease hot spots.

We can use the st_nearest_feature join method from the st_join() function (sf package) to visualize the closest health facility to individual cases.

  1. We begin with the shapefile linelist linelist_sf
  2. We spatially join with sle_hf, which is the locations of health facilities and clinics (points)
# Closest health facility to each case
linelist_sf_hf <- linelist_sf %>%                  # begin with linelist shapefile  
  st_join(sle_hf, join = st_nearest_feature) %>%   # data from nearest clinic joined to case data 
  select(case_id, osm_id, name, amenity) %>%       # keep columns of interest, including id, name, type, and geometry of healthcare facility
  rename("nearest_clinic" = "name")                # re-name for clarity

We can see below (first 50 rows) that the each case now has data on the nearest clinic/hospital

We can see that “Den Clinic” is the closest health facility for about ~30% of the cases.

# Count cases by health facility
hf_catchment <- linelist_sf_hf %>%   # begin with linelist including nearest clinic data
  as.data.frame() %>%                # convert from shapefile to dataframe
  count(nearest_clinic,              # count rows by "name" (of clinic)
        name = "case_n") %>%         # assign new counts column as "case_n"
  arrange(desc(case_n))              # arrange in descending order

hf_catchment                         # print to console
##                          nearest_clinic case_n
## 1                            Den Clinic    355
## 2       Shriners Hospitals for Children    318
## 3         GINER HALL COMMUNITY HOSPITAL    175
## 4                             panasonic     54
## 5 Princess Christian Maternity Hospital     43
## 6                     ARAB EGYPT CLINIC     23
## 7                  MABELL HEALTH CENTER     16
## 8                                  <NA>     16

To visualize the results, we can use tmap - this time interactive mode for easier viewing

tmap_mode("view")   # set tmap mode to interactive  

# plot the cases and clinic points 
tm_shape(linelist_sf_hf) +            # plot cases
  tm_dots(size=0.08,                  # cases colored by nearest clinic
          col='nearest_clinic') +    
tm_shape(sle_hf) +                    # plot clinic facilities in large black dots
  tm_dots(size=0.3, col='black', alpha = 0.4) +      
  tm_text("name") +                   # overlay with name of facility
tm_view(set.view = c(-13.2284, 8.4699, 13), # adjust zoom (center coords, zoom)
        set.zoom.limits = c(13,14))+
tm_layout(title = "Cases, colored by nearest clinic")

Buffers

We can also explore how many cases are located within 2.5km (~30 mins) walking distance from the closest health facility.

Note: For more accurate distance calculations, it is better to re-project your sf object to the respective local map projection system such as UTM (Earth projected onto a planar surface). In this example, for simplicity we will stick to the World Geodetic System (WGS84) Geograhpic coordinate system (Earth represented in a spherical / round surface, therefore the units are in decimal degrees). We will use a general conversion of: 1 decimal degree = ~111km.

See more information about map projections and coordinate systems at this esri article. This blog talks about different types of map projection and how one can choose a suitable projection depending on the area of interest and the context of your map / analysis.

First, create a circular buffer with a radius of ~2.5km around each health facility. This is done with the function st_buffer() from tmap. Because the unit of the map is in lat/long decimal degrees, that is how “0.02” is interpreted. If your map coordinate system is in meters, the number must be provided in meters.

sle_hf_2k <- sle_hf %>%
  st_buffer(dist=0.02)       # decimal degrees translating to approximately 2.5km 

Below we plot the buffer zones themselves, with the :

tmap_mode("plot")
# Create circular buffers
tm_shape(sle_hf_2k) +
  tm_borders(col = "black", lwd = 2)+
tm_shape(sle_hf) +                    # plot clinic facilities in large red dots
  tm_dots(size=0.3, col='black')      

**Second, we intersect these buffers with the cases (points) using st_join() and the join type of st_intersects*. That is, the data from the buffers are joined to the points that they intersect with.

# Intersect the cases with the buffers
linelist_sf_hf_2k <- linelist_sf_hf %>%
  st_join(sle_hf_2k, join = st_intersects, left = TRUE) %>%
  filter(osm_id.x==osm_id.y | is.na(osm_id.y)) %>%
  select(case_id, osm_id.x, nearest_clinic, amenity.x, osm_id.y)

Now we can count the results: nrow(linelist_sf_hf_2k[is.na(linelist_sf_hf_2k$osm_id.y),]) out of 1000 cases did not intersect with any buffer (that value is missing), and so live more than 30 mins walk from the nearest health facility.

# Cases which did not get intersected with any of the health facility buffers
linelist_sf_hf_2k %>% 
  filter(is.na(osm_id.y)) %>%
  nrow()
## [1] 1000

We can visualize the results such that cases that did not intersect with any buffer appear in red.

tmap_mode("view")

# First display the cases in points
tm_shape(linelist_sf_hf) +
  tm_dots(size=0.08, col='nearest_clinic') +

# plot clinic facilities in large black dots
tm_shape(sle_hf) +                    
  tm_dots(size=0.3, col='black')+   

# Then overlay the health facility buffers in polylines
tm_shape(sle_hf_2k) +
  tm_borders(col = "black", lwd = 2) +

# Highlight cases that are not part of any health facility buffers
# in red dots  
tm_shape(linelist_sf_hf_2k %>%  filter(is.na(osm_id.y))) +
  tm_dots(size=0.1, col='red') +
tm_view(set.view = c(-13.2284,8.4699, 13), set.zoom.limits = c(13,14))+

# add title  
tm_layout(title = "Cases by clinic catchment area")

Other spatial joins

Alternative values for argument join include (from the documentation)

  • st_contains_properly
  • st_contains
  • st_covered_by
  • st_covers
  • st_crosses
  • st_disjoint
  • st_equals_exact
  • st_equals
  • st_is_within_distance
  • st_nearest_feature
  • st_overlaps
  • st_touches
  • st_within

28.7 Choropleth maps

Choropleth maps can be useful to visualize your data by pre-defined area, usually administrative unit or health area. In outbreak response this can help to target resource allocation for specific areas with high incidence rates, for example.

Now that we have the administrative unit names assigned to all cases (see section on spatial joins, above), we can start mapping the case counts by area (choropleth maps).

Since we also have population data by ADM3, we can add this information to the case_adm3 table created previously.

We begin with the dataframe created in the previous step case_adm3, which is a summary table of each administrative unit and its number of cases.

  1. The population data sle_adm3_pop are joined using a left_join() from dplyr on the basis of common values across column admin3pcod in the case_adm3 dataframe, and column adm_pcode in the sle_adm3_pop dataframe. See page on Joining data).
  2. select() is applied to the new dataframe, to keep only the useful columns - total is total population
  3. Cases per 10,000 populaton is calculated as a new column with mutate()
# Add population data and calculate cases per 10K population
case_adm3 <- case_adm3 %>% 
     left_join(sle_adm3_pop,                             # add columns from pop dataset
               by = c("admin3pcod" = "adm3_pcode")) %>%  # join based on common values across these two columns
     select(names(case_adm3), total) %>%                 # keep only important columns, including total population
     mutate(case_10kpop = round(cases/total * 10000, 3)) # make new column with case rate per 10000, rounded to 3 decimals

case_adm3                                                # print to console for viewing
## # A tibble: 10 x 5
## # Groups:   admin3pcod [10]
##    admin3pcod admin3name     cases  total case_10kpop
##    <chr>      <chr>          <int>  <int>       <dbl>
##  1 SL040102   Mountain Rural   279  33993       82.1 
##  2 SL040208   West III         218 210252       10.4 
##  3 SL040207   West II          180 145109       12.4 
##  4 SL040204   East II          105  99821       10.5 
##  5 SL040201   Central I         66  69683        9.47
##  6 SL040203   East I            63  68284        9.23
##  7 SL040206   West I            36  60186        5.98
##  8 SL040202   Central II        27  23874       11.3 
##  9 SL040205   East III          24 500134        0.48
## 10 <NA>       <NA>               2     NA       NA

Join this table with the ADM3 polygons shapefile for mapping

case_adm3_sf <- case_adm3 %>%                 # begin with cases & rate by admin unit
  left_join(sle_adm3, by="admin3pcod") %>%    # join to shapefile data by common column
  select(objectid, admin3pcod,                # keep only certain columns of interest
         admin3name = admin3name.x,           # clean name of one column
         admin2name, admin1name,
         cases, total, case_10kpop,
         geometry) %>%                        # keep geometry so polygons can be plotted
  st_as_sf()                                  # convert to shapefile

Mapping the results

# tmap mode
tmap_mode("plot")               # view static map

# plot polygons
tm_shape(case_adm3_sf) + 
        tm_polygons("cases") +  # color by number of cases column
        tm_text("admin3name")   # name display

We can also map the incidence rates

# Cases per 10K population
tmap_mode("plot")             # static viewing mode

# plot
tm_shape(case_adm3_sf) +                # plot polygons
  tm_polygons("case_10kpop",            # color by column containing case rate
              breaks=c(0, 10, 50, 100), # define break points for colors
              palette = "Purples"       # use a purple color palette
              ) +
  tm_text("admin3name")                 # display text

28.8 Mapping with ggplot2

If you are already familiar with using ggplot2, you can use that package instead to create static maps of your data. The geom_sf() function will draw different objects based on which features (points, lines, or polygons) are in your data. For example, you can use geom_sf() in a ggplot() using sf data with polygon geometry to create a choropleth map.

To illustrate how this works, we can start with the ADM3 polygons shapefile that we used earlier. Recall that these are Admin Level 3 regions in Sierra Leone:

sle_adm3
## Simple feature collection with 12 features and 19 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -13.29894 ymin: 8.094272 xmax: -12.91333 ymax: 8.499809
## Geodetic CRS:  WGS 84
## # A tibble: 12 x 20
##    objectid admin3name    admin3pcod admin3ref_n   admin2name     admin2pcod admin1name admin1pcod admin0name  admin0pcod date       valid_on   valid_to   shape_leng
##  *    <dbl> <chr>         <chr>      <chr>         <chr>          <chr>      <chr>      <chr>      <chr>       <chr>      <date>     <date>     <date>          <dbl>
##  1      155 Koya Rural    SL040101   Koya Rural    Western Area ~ SL0401     Western    SL04       Sierra Leo~ SL         2016-08-01 2016-10-17 NA             0.638 
##  2      156 Mountain Rur~ SL040102   Mountain Rur~ Western Area ~ SL0401     Western    SL04       Sierra Leo~ SL         2016-08-01 2016-10-17 NA             0.293 
##  3      157 Waterloo Rur~ SL040103   Waterloo Rur~ Western Area ~ SL0401     Western    SL04       Sierra Leo~ SL         2016-08-01 2016-10-17 NA             0.723 
##  4      158 York Rural    SL040104   York Rural    Western Area ~ SL0401     Western    SL04       Sierra Leo~ SL         2016-08-01 2016-10-17 NA             1.24  
##  5      159 Central I     SL040201   Central I     Western Area ~ SL0402     Western    SL04       Sierra Leo~ SL         2016-08-01 2016-10-17 NA             0.0688
##  6      160 East I        SL040203   East I        Western Area ~ SL0402     Western    SL04       Sierra Leo~ SL         2016-08-01 2016-10-17 NA             0.0575
##  7      161 East II       SL040204   East II       Western Area ~ SL0402     Western    SL04       Sierra Leo~ SL         2016-08-01 2016-10-17 NA             0.0840
##  8      162 Central II    SL040202   Central II    Western Area ~ SL0402     Western    SL04       Sierra Leo~ SL         2016-08-01 2016-10-17 NA             0.0488
##  9      163 West III      SL040208   West III      Western Area ~ SL0402     Western    SL04       Sierra Leo~ SL         2016-08-01 2016-10-17 NA             0.302 
## 10      164 West I        SL040206   West I        Western Area ~ SL0402     Western    SL04       Sierra Leo~ SL         2016-08-01 2016-10-17 NA             0.0695
## 11      165 West II       SL040207   West II       Western Area ~ SL0402     Western    SL04       Sierra Leo~ SL         2016-08-01 2016-10-17 NA             0.149 
## 12      167 East III      SL040205   East III      Western Area ~ SL0402     Western    SL04       Sierra Leo~ SL         2016-08-01 2016-10-17 NA             0.327 
## # ... with 6 more variables: shape_area <dbl>, rowcacode0 <chr>, rowcacode1 <chr>, rowcacode2 <chr>, rowcacode3 <chr>, geometry <MULTIPOLYGON [°]>

We can use the left_join() function from dplyr to add the data we would like to map to the shapefile object. In this case, we are going to use the case_adm3 data frame that we created earlier to summarize case counts by administrative region; however, we can use this same approach to map any data stored in a data frame.

sle_adm3_dat <- sle_adm3 %>% 
  inner_join(case_adm3, by = "admin3pcod") # inner join = retain only if in both data objects

select(sle_adm3_dat, admin3name.x, cases) # print selected variables to console
## Simple feature collection with 9 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -13.29894 ymin: 8.384533 xmax: -13.12612 ymax: 8.499809
## Geodetic CRS:  WGS 84
## # A tibble: 9 x 3
##   admin3name.x   cases                                                                                geometry
##   <chr>          <int>                                                                      <MULTIPOLYGON [°]>
## 1 Mountain Rural   279 (((-13.21496 8.474341, -13.21479 8.474289, -13.21465 8.474296, -13.21455 8.474298, -...
## 2 Central I         66 (((-13.22646 8.489716, -13.22648 8.48955, -13.22644 8.489513, -13.22663 8.489229, -1...
## 3 East I            63 (((-13.2129 8.494033, -13.21076 8.494026, -13.21013 8.494041, -13.2096 8.494025, -13...
## 4 East II          105 (((-13.22653 8.491883, -13.22647 8.491853, -13.22642 8.49186, -13.22633 8.491814, -1...
## 5 Central II        27 (((-13.23154 8.491768, -13.23141 8.491566, -13.23144 8.49146, -13.23131 8.491294, -1...
## 6 West III         218 (((-13.28529 8.497354, -13.28456 8.496497, -13.28403 8.49621, -13.28338 8.496086, -1...
## 7 West I            36 (((-13.24677 8.493453, -13.24669 8.493285, -13.2464 8.493132, -13.24627 8.493131, -1...
## 8 West II          180 (((-13.25698 8.485518, -13.25685 8.485501, -13.25668 8.485505, -13.25657 8.485504, -...
## 9 East III          24 (((-13.20465 8.485758, -13.20461 8.485698, -13.20449 8.485757, -13.20431 8.485577, -...

To make a column chart of case counts by region, using ggplot2, we could then call geom_col() as follows:

ggplot(data=sle_adm3_dat) +
  geom_col(aes(x=fct_reorder(admin3name.x, cases, .desc=T),   # reorder x axis by descending 'cases'
               y=cases)) +                                  # y axis is number of cases by region
  theme_bw() +
  labs(                                                     # set figure text
    title="Number of cases, by administrative unit",
    x="Admin level 3",
    y="Number of cases"
  ) + 
  guides(x=guide_axis(angle=45))                            # angle x-axis labels 45 degrees to fit better

If we want to use ggplot2 to instead make a choropleth map of case counts, we can use similar syntax to call the geom_sf() function:

ggplot(data=sle_adm3_dat) + 
  geom_sf(aes(fill=cases))    # set fill to vary by case count variable

We can then customize the appearance of our map using grammar that is consistent across ggplot2, for example:

ggplot(data=sle_adm3_dat) +                           
  geom_sf(aes(fill=cases)) +                        
  scale_fill_continuous(high="#54278f", low="#f2f0f7") +    # change color gradient
  theme_bw() +
  labs(title = "Number of cases, by administrative unit",   # set figure text
       subtitle = "Admin level 3"
  )

For R users who are comfortable working with ggplot2, geom_sf() offers a simple and direct implementation that is suitable for basic map visualizations. To learn more, read the geom_sf() vignette or the ggplot2 book.

28.9 Basemaps

OpenStreetMap

Below we describe how to achieve a basemap for a ggplot2 map using OpenStreetMap features. Alternative methods include using ggmap which requires free registration with Google (details).

OpenStreetMap is a collaborative project to create a free editable map of the world. The underlying geolocation data (e.g. locations of cities, roads, natural features, airports, schools, hospitals, roads etc) are considered the primary output of the project.

First we load the OpenStreetMap package, from which we will get our basemap.

Then, we create the object map, which we define using the function openmap() from OpenStreetMap package (documentation). We provide the following:

  • upperLeft and lowerRight Two coordinate pairs specifying the limits of the basemap tile
    • In this case we’ve put in the max and min from the linelist rows, so the map will respond dynamically to the data
  • zoom = (if null it is determined automatically)
  • type = which type of basemap - we have listed several possibilities here and the code is currently using the first one ([1]) “osm”
  • mergeTiles = we chose TRUE so the basetiles are all merged into one
# load package
pacman::p_load(OpenStreetMap)

# Fit basemap by range of lat/long coordinates. Choose tile type
map <- openmap(
  upperLeft = c(max(linelist$lat, na.rm=T), max(linelist$lon, na.rm=T)),   # limits of basemap tile
  lowerRight = c(min(linelist$lat, na.rm=T), min(linelist$lon, na.rm=T)),
  zoom = NULL,
  type = c("osm", "stamen-toner", "stamen-terrain", "stamen-watercolor", "esri","esri-topo")[1])

If we plot this basemap right now, using autoplot.OpenStreetMap() from OpenStreetMap package, you see that the units on the axes are not latitude/longitude coordinates. It is using a different coordinate system. To correctly display the case residences (which are stored in lat/long), this must be changed.

autoplot.OpenStreetMap(map)

Thus, we want to convert the map to latitude/longitude with the openproj() function from OpenStreetMap package. We provide the basemap map and also provide the Coordinate Reference System (CRS) we want. We do this by providing the “proj.4” character string for the WGS 1984 projection, but you can provide the CRS in other ways as well. (see this page to better understand what a proj.4 string is)

# Projection WGS84
map_latlon <- openproj(map, projection = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")

Now when we create the plot we see that along the axes are latitude and longitude coordinate. The coordinate system has been converted. Now our cases will plot correctly if overlaid!

# Plot map. Must use "autoplot" in order to work with ggplot
autoplot.OpenStreetMap(map_latlon)

See the tutorials here and here for more info.

28.10 Contoured density heatmaps

Below we describe how to achieve a contoured density heatmap of cases, over a basemap, beginning with a linelist (one row per case).

  1. Create basemap tile from OpenStreetMap, as described above
  2. Plot the cases from linelist using the latitude and longitude columns
  3. Convert the points to a density heatmap with stat_density_2d() from ggplot2,

When we have a basemap with lat/long coordinates, we can plot our cases on top using the lat/long coordinates of their residence.

Building on the function autoplot.OpenStreetMap() to create the basemap, ggplot2 functions will easily add on top, as shown with geom_point() below:

# Plot map. Must be autoplotted to work with ggplot
autoplot.OpenStreetMap(map_latlon)+                 # begin with the basemap
  geom_point(                                       # add xy points from linelist lon and lat columns 
    data = linelist,                                
    aes(x = lon, y = lat),
    size = 1, 
    alpha = 0.5,
    show.legend = FALSE) +                          # drop legend entirely
  labs(x = "Longitude",                             # titles & labels
       y = "Latitude",
       title = "Cumulative cases")

The map above might be difficult to interpret, especially with the points overlapping. So you can instead plot a 2d density map using the ggplot2 function stat_density_2d(). You are still using the linelist lat/lon coordinates, but a 2D kernel density estimation is performed and the results are displayed with contour lines - like a topographical map. Read the full documentation here.

# begin with the basemap
autoplot.OpenStreetMap(map_latlon)+
  
  # add the density plot
  ggplot2::stat_density_2d(
        data = linelist,
        aes(
          x = lon,
          y = lat,
          fill = ..level..,
          alpha = ..level..),
        bins = 10,
        geom = "polygon",
        contour_var = "count",
        show.legend = F) +                          
  
  # specify color scale
  scale_fill_gradient(low = "black", high = "red")+
  
  # labels 
  labs(x = "Longitude",
       y = "Latitude",
       title = "Distribution of cumulative cases")

Time series heatmap

The density heatmap above shows cumulative cases. We can examine the outbreak over time and space by faceting the heatmap based on the month of symptom onset, as derived from the linelist.

We begin in the linelist, creating a new column with the Year and Month of onset. The format() function from base R changes how a date is displayed. In this case we want “YYYY-MM”.

# Extract month of onset
linelist <- linelist %>% 
  mutate(date_onset_ym = format(date_onset, "%Y-%m"))

# Examine the values 
table(linelist$date_onset_ym, useNA = "always")
## 
## 2014-04 2014-05 2014-06 2014-07 2014-08 2014-09 2014-10 2014-11 2014-12 2015-01 2015-02 2015-03 2015-04    <NA> 
##       1      14      15      51      73     162     192     133     103      78      47      42      35      54

Now, we simply introduce facetting via ggplot2 to the density heatmap. facet_wrap() is applied, using the new column as rows. We set the number of facet columns to 3 for clarity.

# packages
pacman::p_load(OpenStreetMap, tidyverse)

# begin with the basemap
autoplot.OpenStreetMap(map_latlon)+
  
  # add the density plot
  ggplot2::stat_density_2d(
        data = linelist,
        aes(
          x = lon,
          y = lat,
          fill = ..level..,
          alpha = ..level..),
        bins = 10,
        geom = "polygon",
        contour_var = "count",
        show.legend = F) +                          
  
  # specify color scale
  scale_fill_gradient(low = "black", high = "red")+
  
  # labels 
  labs(x = "Longitude",
       y = "Latitude",
       title = "Distribution of cumulative cases over time")+
  
  # facet the plot by month-year of onset
  facet_wrap(~ date_onset_ym, ncol = 4)               

28.11 Spatial statistics

Most of our discussion so far has focused on visualization of spatial data. In some cases, you may also be interested in using spatial statistics to quantify the spatial relationships of attributes in your data. This section will provide a very brief overview of some key concepts in spatial statistics, and suggest some resources that will be helpful to explore if you wish to do more comprehensive spatial analyses.

Spatial relationships

Before we can calculate any spatial statistics, we need to specify the relationships between features in our data. There are many ways to conceptualize spatial relationships, but a simple and commonly-applicable model to use is that of adjacency - specifically, that we expect a geographic relationship between areas that share a border or “neighbour” one another.

We can quantify adjacency relationships between administrative region polygons in the sle_adm3 data we have been using with the spdep package. We will specify queen contiguity, which means that regions will be neighbors if they share at least one point along their borders. The alternative would be rook contiguity, which requires that regions share an edge - in our case, with irregular polygons, the distinction is trivial, but in some cases the choice between queen and rook can be influential.

sle_nb <- spdep::poly2nb(sle_adm3_dat, queen=T) # create neighbors 
sle_adjmat <- spdep::nb2mat(sle_nb)    # create matrix summarizing neighbor relationships
sle_listw <- spdep::nb2listw(sle_nb)   # create listw (list of weights) object -- we will need this later

sle_nb
## Neighbour list object:
## Number of regions: 9 
## Number of nonzero links: 30 
## Percentage nonzero weights: 37.03704 
## Average number of links: 3.333333
round(sle_adjmat, digits = 2)
##   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## 1 0.00 0.20 0.00 0.20 0.00  0.2 0.00 0.20 0.20
## 2 0.25 0.00 0.00 0.25 0.25  0.0 0.00 0.25 0.00
## 3 0.00 0.00 0.00 0.50 0.00  0.0 0.00 0.00 0.50
## 4 0.25 0.25 0.25 0.00 0.00  0.0 0.00 0.00 0.25
## 5 0.00 0.33 0.00 0.00 0.00  0.0 0.33 0.33 0.00
## 6 0.50 0.00 0.00 0.00 0.00  0.0 0.00 0.50 0.00
## 7 0.00 0.00 0.00 0.00 0.50  0.0 0.00 0.50 0.00
## 8 0.20 0.20 0.00 0.00 0.20  0.2 0.20 0.00 0.00
## 9 0.33 0.00 0.33 0.33 0.00  0.0 0.00 0.00 0.00
## attr(,"call")
## spdep::nb2mat(neighbours = sle_nb)

The matrix printed above shows the relationships between the 9 regions in our sle_adm3 data. A score of 0 indicates two regions are not neighbors, while any value other than 0 indicates a neighbor relationship. The values in the matrix are scaled so that each region has a total row weight of 1.

A better way to visualize these neighbor relationships is by plotting them:

plot(sle_adm3_dat$geometry) +                                           # plot region boundaries
  spdep::plot.nb(sle_nb,as(sle_adm3_dat, 'Spatial'), col='grey', add=T) # add neighbor relationships

We have used an adjacency approach to identify neighboring polygons; the neighbors we identified are also sometimes called contiguity-based neighbors. But this is just one way of choosing which regions are expected to have a geographic relationship. The most common alternative approaches for identifying geographic relationships generate distance-based neighbors; briefly, these are:

  • K-nearest neighbors - Based on the distance between centroids (the geographically-weighted center of each polygon region), select the n closest regions as neighbors. A maximum-distance proximity threshold may also be specified. In spdep, you can use knearneigh() (see documentation).

  • Distance threshold neighbors - Select all neighbors within a distance threshold. In spdep, these neighbor relationships can be identified using dnearneigh() (see documentation).

Spatial autocorrelation

Tobler’s oft-cited first law of geography states that “everything is related to everything else, but near things are more related than distant things.” In epidemiology, this often means that risk of a particular health outcome in a given region is more similar to its neighboring regions than to those far away. This concept has been formalized as spatial autocorrelation - the statistical property that geographic features with similar values are clustered together in space. Statistical measures of spatial autocorrelation can be used to quantify the extent of spatial clustering in your data, locate where clustering occurs, and identify shared patterns of spatial autocorrelation between distinct variables in your data. This section gives an overview of some common measures of spatial autocorrelation and how to calculate them in R.

Moran’s I - This is a global summary statistic of the correlation between the value of a variable in one region, and the values of the same variable in neighboring regions. The Moran’s I statistic typically ranges from -1 to 1. A value of 0 indicates no pattern of spatial correlation, while values closer to 1 or -1 indicate stronger spatial autocorrelation (similar values close together) or spatial dispersion (dissimilar values close together), respectively.

For an example, we will calculate a Moran’s I statistic to quantify the spatial autocorrelation in Ebola cases we mapped earlier (remember, this is a subset of cases from the simulated epidemic linelist dataframe). The spdep package has a function, moran.test, that can do this calculation for us:

moran_i <-spdep::moran.test(sle_adm3_dat$cases,    # numeric vector with variable of interest
                            listw=sle_listw)       # listw object summarizing neighbor relationships

moran_i                                            # print results of Moran's I test
## 
##  Moran I test under randomisation
## 
## data:  sle_adm3_dat$cases  
## weights: sle_listw    
## 
## Moran I statistic standard deviate = 1.5943, p-value = 0.05543
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##        0.20612239       -0.12500000        0.04313463

The output from the moran.test() function shows us a Moran I statistic of round(moran_i$estimate[1],2). This indicates the presence of spatial autocorrelation in our data - specifically, that regions with similar numbers of Ebola cases are likely to be close together. The p-value provided by moran.test() is generated by comparison to the expectation under null hypothesis of no spatial autocorrelation, and can be used if you need to report the results of a formal hypothesis test.

Local Moran’s I - We can decompose the (global) Moran’s I statistic calculated above to identify localized spatial autocorrelation; that is, to identify specific clusters in our data. This statistic, which is sometimes called a Local Indicator of Spatial Association (LISA) statistic, summarizes the extent of spatial autocorrelation around each individual region. It can be useful for finding “hot” and “cold” spots on the map.

To show an example, we can calculate and map Local Moran’s I for the Ebola case counts used above, with the local_moran() function from spdep:

# calculate local Moran's I
local_moran <- spdep::localmoran(                  
  sle_adm3_dat$cases,                              # variable of interest
  listw=sle_listw                                  # listw object with neighbor weights
)

# join results to sf data
sle_adm3_dat<- cbind(sle_adm3_dat, local_moran)    

# plot map
ggplot(data=sle_adm3_dat) +
  geom_sf(aes(fill=Ii)) +
  theme_bw() +
  scale_fill_gradient2(low="#2c7bb6", mid="#ffffbf", high="#d7191c",
                       name="Local Moran's I") +
  labs(title="Local Moran's I statistic for Ebola cases",
       subtitle="Admin level 3 regions, Sierra Leone")

Getis-Ord Gi* - This is another statistic that is commonly used for hotspot analysis; in large part, the popularity of this statistic relates to its use in the Hot Spot Analysis tool in ArcGIS. It is based on the assumption that typically, the difference in a variable’s value between neighboring regions should follow a normal distribution. It uses a z-score approach to identify regions that have significantly higher (hot spot) or significantly lower (cold spot) values of a specified variable, compared to their neighbors.

We can calculate and map the Gi* statistic using the localG() function from spdep:

# Perform local G analysis
getis_ord <- spdep::localG(
  sle_adm3_dat$cases,
  sle_listw
)

# join results to sf data
sle_adm3_dat$getis_ord <- getis_ord

# plot map
ggplot(data=sle_adm3_dat) +
  geom_sf(aes(fill=getis_ord)) +
  theme_bw() +
  scale_fill_gradient2(low="#2c7bb6", mid="#ffffbf", high="#d7191c",
                       name="Gi*") +
  labs(title="Getis-Ord Gi* statistic for Ebola cases",
       subtitle="Admin level 3 regions, Sierra Leone")

As you can see, the map of Getis-Ord Gi* looks slightly different from the map of Local Moran’s I produced earlier. This reflects that the method used to calculate these two statistics are slightly different; which one you should use depends on your specific use case and the research question of interest.

Lee’s L test - This is a statistical test for bivariate spatial correlation. It allows you to test whether the spatial pattern for a given variable x is similar to the spatial pattern of another variable, y, that is hypothesized to be related spatially to x.

To give an example, let’s test whether the spatial pattern of Ebola cases from the simulated epidemic is correlated with the spatial pattern of population. To start, we need to have a population variable in our sle_adm3 data. We can use the total variable from the sle_adm3_pop dataframe that we loaded earlier.

sle_adm3_dat <- sle_adm3_dat %>% 
  rename(population = total)                          # rename 'total' to 'population'

We can quickly visualize the spatial patterns of the two variables side by side, to see whether they look similar:

tmap_mode("plot")

cases_map <- tm_shape(sle_adm3_dat) + tm_polygons("cases") + tm_layout(main.title="Cases")
pop_map <- tm_shape(sle_adm3_dat) + tm_polygons("population") + tm_layout(main.title="Population")

tmap_arrange(cases_map, pop_map, ncol=2)   # arrange into 2x1 facets

Visually, the patterns seem dissimilar. We can use the lee.test() function in spdep to test statistically whether the pattern of spatial autocorrelation in the two variables is related. The L statistic will be close to 0 if there is no correlation between the patterns, close to 1 if there is a strong positive correlation (i.e. the patterns are similar), and close to -1 if there is a strong negative correlation (i.e. the patterns are inverse).

lee_test <- spdep::lee.test(
  x=sle_adm3_dat$cases,          # variable 1 to compare
  y=sle_adm3_dat$population,     # variable 2 to compare
  listw=sle_listw                # listw object with neighbor weights
)

lee_test
## 
##  Lee's L statistic randomisation
## 
## data:  sle_adm3_dat$cases ,  sle_adm3_dat$population 
## weights: sle_listw  
## 
## Lee's L statistic standard deviate = -0.99022, p-value = 0.839
## alternative hypothesis: greater
## sample estimates:
## Lee's L statistic       Expectation          Variance 
##       -0.15362746       -0.04364708        0.01233582

The output above shows that the Lee’s L statistic for our two variables was round(lee_test$estimate[1],2), which indicates weak negative correlation. This confirms our visual assessment that the pattern of cases and population are not related to one another, and provides evidence that the spatial pattern of cases is not strictly a result of population density in high-risk areas.

The Lee L statistic can be useful for making these kinds of inferences about the relationship between spatially distributed variables; however, to describe the nature of the relationship between two variables in more detail, or adjust for confounding, spatial regression techniques will be needed. These are described briefly in the following section.

Spatial regression

You may wish to make statistical inferences about the relationships between variables in your spatial data. In these cases, it is useful to consider spatial regression techniques - that is, approaches to regression that explicitly consider the spatial organization of units in your data. Some reasons that you may need to consider spatial regression models, rather than standard regression models such as GLMs, include:

  • Standard regression models assume that residuals are independent from one another. In the presence of strong spatial autocorrelation, the residuals of a standard regression model are likely to be spatially autocorrelated as well, thus violating this assumption. This can lead to problems with interpreting the model results, in which case a spatial model would be preferred.

  • Regression models also typically assume that the effect of a variable x is constant over all observations. In the case of spatial heterogeneity, the effects we wish to estimate may vary over space, and we may be interested in quantifying those differences. In this case, spatial regression models offer more flexibility for estimating and interpreting effects.

The details of spatial regression approaches are beyond the scope of this handbook. This section will instead provide an overview of the most common spatial regression models and their uses, and refer you to references that may of use if you wish to explore this area further.

Spatial error models - These models assume that the error terms across spatial units are correlated, in which case the data would violate the assumptions of a standard OLS model. Spatial error models are also sometimes referred to as simultaneous autoregressive (SAR) models. They can be fit using the errorsarlm() function in the spatialreg package (spatial regression functions which used to be a part of spdep).

Spatial lag models - These models assume that the dependent variable for a region i is influenced not only by value of independent variables in i, but also by the values of those variables in regions neighboring i. Like spatial error models, spatial lag models are also sometimes described as simultaneous autoregressive (SAR) models. They can be fit using the lagsarlm() function in the spatialreg package.

The spdep package contains several useful diagnostic tests for deciding between standard OLS, spatial lag, and spatial error models. These tests, called Lagrange Multiplier diagnostics, can be used to identify the type of spatial dependence in your data and choose which model is most appropriate. The function lm.LMtests() can be used to calculate all of the Lagrange Multiplier tests. Anselin (1988) also provides a useful flow chart tool to decide which spatial regression model to use based on the results of the Lagrange Multiplier tests:

Bayesian hierarchical models - Bayesian approaches are commonly used for some applications in spatial analysis, most commonly for disease mapping. They are preferred in cases where case data are sparsely distributed (for example, in the case of a rare outcome) or statistically “noisy”, as they can be used to generate “smoothed” estimates of disease risk by accounting for the underlying latent spatial process. This may improve the quality of estimates. They also allow investigator pre-specification (via choice of prior) of complex spatial correlation patterns that may exist in the data, which can account for spatially-dependent and -independent variation in both independent and dependent variables. In R, Bayesian hierarchical models can be fit using the CARbayes package (see vignette) or R-INLA (see website and textbook). R can also be used to call external software that does Bayesian estimation, such as JAGS or WinBUGS.

28.12 Resources

(PART) Data Visualization

29 Tables for presentation

This page demonstrates how to convert summary data frames into presentation-ready tables with the flextable package. These tables can be inserted into powerpoint slides, HTML pages, PDF or Word documents, etc.

Understand that before using flextable, you must create the summary table as a data frame. Use methods from the Descriptive tables and Pivoting data pages such as tabulations, cross-tabulations, pivoting, and calculating descriptive statistics. The resulting data frame can then be passed to flextable for display formatting.

There are many other R packages that can be used to craft tables for presentation - we chose to highlight flextable in this page. An example using the knitr package and its kable() function can be found in the Contact Tracing page. Likewise, the DT package is highlighted in the page Dashboards with Shiny. Others such as GT and huxtable are mentione in the Suggested packages page.

29.1 Preparation

Load packages

Install and load flextable. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
  rio,            # import/export
  here,           # file pathways
  flextable,      # make HTML tables 
  officer,        # helper functions for tables
  tidyverse)      # data management, summary, and visualization

Import data

To begin, we import the cleaned linelist of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import data with the import() function from the rio package (it handles many file types like .xlsx, .csv, .rds - see the Import and export page for details).

# import the linelist
linelist <- import("linelist_cleaned.rds")

The first 50 rows of the linelist are displayed below.

Prepare table

Before beginning to use flextable you will need to create your table as a data frame. See the page on Descriptive tables and Pivoting data to learn how to create a data frame using packages such as janitor and dplyr. You must arrange the content in rows and columns as you want it displayed. Then, the data frame will be passed to flextable to display it with colors, headers, fonts, etc.

Below is an example from the Descriptive tables page of converting the case linelist into a data frame that summarises patient outcomes and CT values by hospital, with a Totals row at the bottom. The output is saved as table.

table <- linelist %>% 
  
  # Get summary values per hospital-outcome group
  ###############################################
  group_by(hospital, outcome) %>%                      # Group data
  summarise(                                           # Create new summary columns of indicators of interest
    N = n(),                                            # Number of rows per hospital-outcome group     
    ct_value = median(ct_blood, na.rm=T)) %>%           # median CT value per group
  
  # add totals
  ############
  bind_rows(                                           # Bind the previous table with this mini-table of totals
    linelist %>% 
      filter(!is.na(outcome) & hospital != "Missing") %>%
      group_by(outcome) %>%                            # Grouped only by outcome, not by hospital    
      summarise(
        N = n(),                                       # Number of rows for whole dataset     
        ct_value = median(ct_blood, na.rm=T))) %>%     # Median CT for whole dataset
  
  # Pivot wider and format
  ########################
  mutate(hospital = replace_na(hospital, "Total")) %>% 
  pivot_wider(                                         # Pivot from long to wide
    values_from = c(ct_value, N),                       # new values are from ct and count columns
    names_from = outcome) %>%                           # new column names are from outcomes
  mutate(                                              # Add new columns
    N_Known = N_Death + N_Recover,                               # number with known outcome
    Pct_Death = scales::percent(N_Death / N_Known, 0.1),         # percent cases who died (to 1 decimal)
    Pct_Recover = scales::percent(N_Recover / N_Known, 0.1)) %>% # percent who recovered (to 1 decimal)
  select(                                              # Re-order columns
    hospital, N_Known,                                   # Intro columns
    N_Recover, Pct_Recover, ct_value_Recover,            # Recovered columns
    N_Death, Pct_Death, ct_value_Death)  %>%             # Death columns
  arrange(N_Known)                                    # Arrange rows from lowest to highest (Total row at bottom)

table  # print
## # A tibble: 7 x 8
## # Groups:   hospital [7]
##   hospital                             N_Known N_Recover Pct_Recover ct_value_Recover N_Death Pct_Death ct_value_Death
##   <chr>                                  <int>     <int> <chr>                  <dbl>   <int> <chr>              <dbl>
## 1 St. Mark's Maternity Hospital (SMMH)     325       126 38.8%                     22     199 61.2%                 22
## 2 Central Hospital                         358       165 46.1%                     22     193 53.9%                 22
## 3 Other                                    685       290 42.3%                     21     395 57.7%                 22
## 4 Military Hospital                        708       309 43.6%                     22     399 56.4%                 21
## 5 Missing                                 1125       514 45.7%                     21     611 54.3%                 21
## 6 Port Hospital                           1364       579 42.4%                     21     785 57.6%                 22
## 7 Total                                   3440      1469 42.7%                     22    1971 57.3%                 22

29.2 Basic flextable

Create a flextable

To create and manage flextable objects, we first pass the data frame through the flextable() function. We save the result as my_table.

my_table <- flextable(table) 
my_table

After doing this, we can progressively pipe the my_table object through more flextable formatting functions.

In this page for sake of clarity we will save the table at intermediate steps as my_table, adding flextable functions bit-by-bit. If you want to see all the code from beginning to end written in one chunk, visit the All code together section below.

The general syntax of each line of flextable code is as follows:

  • function(table, i = X, j = X, part = "X"), where:
    • The ‘function’ can be one of many different functions, such as width() to determine column widths, bg() to set background colours, align() to set whether text is centre/right/left aligned, and so on.
    • table = is the name of the data frame, although does not need to be stated if the data frame is piped into the function.
    • part = refers to which part of the table the function is being applied to. E.g. “header”, “body” or “all”.
    • i = specifies the row to apply the function to, where ‘X’ is the row number. If multiple rows, e.g. the first to third rows, one can specify: i = c(1:3). Note if ‘body’ is selected, the first row starts from underneath the header section.
    • j = specifies the column to apply the function to, where ‘x’ is the column number or name. If multiple columns, e.g. the fifth and sixth, one can specify: j = c(5,6).

You can find the complete list of flextable formatting function here or review the documentation by entering ?flextable.

Column width

We can use the autofit() function, which nicely stretches out the table so that each cell only has one row of text. The function qflextable() is a convenient shorthand for flextable() and autofit().

my_table %>% autofit()

However, this might not always be appropriate, especially if there are very long values within cells, meaning the table might not fit on the page.

Instead, we can specify widths with the width() function. It can take some playing around to know what width value to put. In the example below, we specify different widths for column 1, column 2, and columns 4 to 8.

my_table <- my_table %>% 
  width(j=1, width = 2.7) %>% 
  width(j=2, width = 1.5) %>% 
  width(j=c(4,5,7,8), width = 1)

my_table

Column headers

We want more clearer headers for easier interpretation of the table contents.

For this table, we will want to add a second header layer so that columns covering the same subgroups can be grouped together. We do this with the add_header_row() function with top = TRUE. We provide the new name of each column to values =, leaving empty values "" for columns we know we will merge together later.

We also rename the header names in the now-second header in a separate set_header_labels() command.

Finally, to “combine” certain column headers in the top header we use merge_at() to merge the column headers in the top header row.

my_table <- my_table %>% 
  
  add_header_row(
    top = TRUE,                # New header goes on top of existing header row
    values = c("Hospital",     # Header values for each column below
               "Total cases with known outcome", 
               "Recovered",    # This will be the top-level header for this and two next columns
               "",
               "",
               "Died",         # This will be the top-level header for this and two next columns
               "",             # Leave blank, as it will be merged with "Died"
               "")) %>% 
    
  set_header_labels(         # Rename the columns in original header row
      hospital = "", 
      N_Known = "",                  
      N_Recover = "Total",
      Pct_Recover = "% of cases",
      ct_value_Recover = "Median CT values",
      N_Death = "Total",
      Pct_Death = "% of cases",
      ct_value_Death = "Median CT values")  %>% 
  
  merge_at(i = 1, j = 3:5, part = "header") %>% # Horizontally merge columns 3 to 5 in new header row
  merge_at(i = 1, j = 6:8, part = "header")     # Horizontally merge columns 6 to 8 in new header row

my_table  # print

Borders and background

You can adjust the borders, internal lines, etc. with various flextable functions. It is often easier to start by removing all existing borders with border_remove().

Then, you can apply default border themes by passing the table to theme_box(), theme_booktabs(), or theme_alafoli().

You can add vertical and horizontal lines with a variety of functions. hline() and vline() add lines to a specified row or column, respectively. Within each, you must specify the part = as either “all”, “body”, or “header”. For vertical lines, specify the column to j =, and for horizontal lines the row to i =. Other functions like vline_right(), vline_left(), hline_top(), and hline_bottom() add lines to the outsides only.

In all of these functions, the actual line style itself must be specified to border = and must be the output of a separate command using the fp_border() function from the officer package. This function helps you define the width and color of the line. You can define this above the table commands, as shown below.

# define style for border line
border_style = officer::fp_border(color="black", width=1)

# add border lines to table
my_table <- my_table %>% 

  # Remove all existing borders
  border_remove() %>%  
  
  # add horizontal lines via a pre-determined theme setting
  theme_booktabs() %>% 
  
  # add vertical lines to separate Recovered and Died sections
  vline(part = "all", j = 2, border = border_style) %>%   # at column 2 
  vline(part = "all", j = 5, border = border_style)       # at column 5

my_table

Font and alignment

We centre-align all columns aside from the left-most column with the hospital names, using the align() function from flextable.

my_table <- my_table %>% 
   flextable::align(align = "center", j = c(2:8), part = "all") 
my_table

Additionally, we can increase the header font size and change then to bold. We can also change the total row to bold.

my_table <-  my_table %>%  
  fontsize(i = 1, size = 12, part = "header") %>%   # adjust font size of header
  bold(i = 1, bold = TRUE, part = "header") %>%     # adjust bold face of header
  bold(i = 7, bold = TRUE, part = "body")           # adjust bold face of total row (row 7 of body)

my_table

We can ensure that the proportion columns display only one decimal place using the function colformat_num(). Note this could also have been done at data management stage with the round() function.

my_table <- colformat_num(my_table, j = c(4,7), digits = 1)
my_table

Merge cells

Just as we merge cells horizontally in the header row, we can also merge cells vertically using merge_at() and specifying the rows (i) and column (j). Here we merge the “Hospital” and “Total cases with known outcome” values vertically to give them more space.

my_table <- my_table %>% 
  merge_at(i = 1:2, j = 1, part = "header") %>% 
  merge_at(i = 1:2, j = 2, part = "header")

my_table

Background color

To distinguish the content of the table from the headers, we may want to add additional formatting. e.g. changing the background color. In this example we change the table body to gray.

my_table <- my_table %>% 
    bg(part = "body", bg = "gray95")  

my_table 

29.3 Conditional formatting

We can highlight all values in a column that meet a certain rule, e.g. where more than 55% of cases died. Simply put the criteria to the i = or j = argument, preceded by a tilde ~. Reference the column in the data frame, not the display heading values.

my_table %>% 
  bg(j = 7, i = ~ Pct_Death >= 55, part = "body", bg = "red") 

Or, we can highlight the entire row meeting a certain criterion, such as a hospital of interest. To do this we just remove the column (j) specification so the criteria apply to all columns.

my_table %>% 
  bg(., i= ~ hospital == "Military Hospital", part = "body", bg = "#91c293") 

29.4 All code together

Below we show all the code from the above sections together.

border_style = officer::fp_border(color="black", width=1)

pacman::p_load(
  rio,            # import/export
  here,           # file pathways
  flextable,      # make HTML tables 
  officer,        # helper functions for tables
  tidyverse)      # data management, summary, and visualization

table <- linelist %>% 

  # Get summary values per hospital-outcome group
  ###############################################
  group_by(hospital, outcome) %>%                      # Group data
  summarise(                                           # Create new summary columns of indicators of interest
    N = n(),                                            # Number of rows per hospital-outcome group     
    ct_value = median(ct_blood, na.rm=T)) %>%           # median CT value per group
  
  # add totals
  ############
  bind_rows(                                           # Bind the previous table with this mini-table of totals
    linelist %>% 
      filter(!is.na(outcome) & hospital != "Missing") %>%
      group_by(outcome) %>%                            # Grouped only by outcome, not by hospital    
      summarise(
        N = n(),                                       # Number of rows for whole dataset     
        ct_value = median(ct_blood, na.rm=T))) %>%     # Median CT for whole dataset
  
  # Pivot wider and format
  ########################
  mutate(hospital = replace_na(hospital, "Total")) %>% 
  pivot_wider(                                         # Pivot from long to wide
    values_from = c(ct_value, N),                       # new values are from ct and count columns
    names_from = outcome) %>%                           # new column names are from outcomes
  mutate(                                              # Add new columns
    N_Known = N_Death + N_Recover,                               # number with known outcome
    Pct_Death = scales::percent(N_Death / N_Known, 0.1),         # percent cases who died (to 1 decimal)
    Pct_Recover = scales::percent(N_Recover / N_Known, 0.1)) %>% # percent who recovered (to 1 decimal)
  select(                                              # Re-order columns
    hospital, N_Known,                                   # Intro columns
    N_Recover, Pct_Recover, ct_value_Recover,            # Recovered columns
    N_Death, Pct_Death, ct_value_Death)  %>%             # Death columns
  arrange(N_Known) %>%                                 # Arrange rows from lowest to highest (Total row at bottom)

  # formatting
  ############
  flextable() %>%              # table is piped in from above
  add_header_row(
    top = TRUE,                # New header goes on top of existing header row
    values = c("Hospital",     # Header values for each column below
               "Total cases with known outcome", 
               "Recovered",    # This will be the top-level header for this and two next columns
               "",
               "",
               "Died",         # This will be the top-level header for this and two next columns
               "",             # Leave blank, as it will be merged with "Died"
               "")) %>% 
    set_header_labels(         # Rename the columns in original header row
      hospital = "", 
      N_Known = "",                  
      N_Recover = "Total",
      Pct_Recover = "% of cases",
      ct_value_Recover = "Median CT values",
      N_Death = "Total",
      Pct_Death = "% of cases",
      ct_value_Death = "Median CT values")  %>% 
  merge_at(i = 1, j = 3:5, part = "header") %>% # Horizontally merge columns 3 to 5 in new header row
  merge_at(i = 1, j = 6:8, part = "header") %>%  
  border_remove() %>%  
  theme_booktabs() %>% 
  vline(part = "all", j = 2, border = border_style) %>%   # at column 2 
  vline(part = "all", j = 5, border = border_style) %>%   # at column 5
  merge_at(i = 1:2, j = 1, part = "header") %>% 
  merge_at(i = 1:2, j = 2, part = "header") %>% 
  width(j=1, width = 2.7) %>% 
  width(j=2, width = 1.5) %>% 
  width(j=c(4,5,7,8), width = 1) %>% 
  flextable::align(., align = "center", j = c(2:8), part = "all") %>% 
  bg(., part = "body", bg = "gray95")  %>% 
  bg(., j=c(1:8), i= ~ hospital == "Military Hospital", part = "body", bg = "#91c293") %>% 
  colformat_num(., j = c(4,7), digits = 1) %>%
  bold(i = 1, bold = TRUE, part = "header") %>% 
  bold(i = 7, bold = TRUE, part = "body")
## `summarise()` has grouped output by 'hospital'. You can override using the `.groups` argument.
table

29.5 Saving your table

There are different ways the table can be integrated into your output.

Save single table

You can export the tables to Word, PowerPoint or HTML or as an image (PNG) files. To do this, use one of the following functions:

  • save_as_docx()
  • save_as_pptx()
  • save_as_image()
  • save_as_html()

For instance below we save our table as a word document. Note the syntax of the first argument - you can just provide the name of your flextable object e.g. my_table, or you can give is a “name” as shown below (the name is “my table”). If name, this will appear as the title of the table in Word. We also demonstrate code to save as PNG image.

# Edit the 'my table' as needed for the title of table.  
save_as_docx("my table" = my_table, path = "file.docx")

save_as_image(my_table, path = "file.png")

Note the packages webshot or webshot2 are required to save a flextable as an image. Images may come out with transparent backgrounds.

If you want to view a ‘live’ version of the flextable output in the intended document format, use print() and specify one of the below to preview =. The document will “pop-up” open on your computer in the specified software program, but will not be saved. This can be useful to check if the table fits in one page/slide or so you can quickly copy it into another document, you can use the print method with the argument preview set to “pptx” or “docx”.

print(my_table, preview = "docx") # Word document example
print(my_table, preview = "pptx") # Powerpoint example

29.6 Resources

The full flextable book is here: https://ardata-fr.github.io/flextable-book/ The Github site is here
A manual of all the flextable functions can be found here

A gallery of beautiful example flextable tables with code can be accessed here

30 ggplot basics

ggplot2 is the most popular data visualisation R package. Its ggplot() function is at the core of this package, and this whole approach is colloquially known as “ggplot” with the resulting figures sometimes affectionately called “ggplots”. The “gg” in these names reflects the “grammar of graphics” used to construct the figures. ggplot2 benefits from a wide variety of supplementary R packages that further enhance its functionality.

The syntax is significantly different from base R plotting, and has a learning curve associated with it. Using ggplot2 generally requires the user to format their data in a way that is highly tidyverse compatible, which ultimately makes using these packages together very effective.

In this page we will cover the fundamentals of plotting with ggplot2. See the page ggplot tips for suggestions and advanced techniques to make your plots really look nice.

There are several extensive ggplot2 tutorials linked in the resources section. You can also download this data visualization with ggplot cheatsheet from the RStudio website. If you want inspiration for ways to creatively visualise your data, we suggest reviewing websites like the R graph gallery and Data-to-viz.

30.1 Preparation

Load packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
  tidyverse,      # includes ggplot2 and other data management tools
  rio,            # import/export
  here,           # file locator
  stringr         # working with characters   
)

Import data

We import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import your data with the import() function from the rio package (it accepts many file types like .xlsx, .rds, .csv - see the Import and export page for details).

linelist <- rio::import("linelist_cleaned.rds")

The first 50 rows of the linelist are displayed below. We will focus on the continuous variables age, wt_kg (weight in kilos), ct_blood (CT values), and days_onset_hosp (difference between onset date and hospitalisation).

General cleaning

When preparing data to plot, it is best to make the data adhere to “tidy” data standards as much as possible. How to achieve this is expanded on in the data management pages of this handbook, such as Cleaning data and core functions.

Some simple ways we can prepare our data to make it better for plotting can include making the contents of the data better for display - which does not necessarily equate to better for data manipulation. For example:

  • Replace NA values in a character column with the character string “Unknown”
  • Consider converting column to class factor so their values have prescribed ordinal levels
  • Clean some columns so that their “data friendly” values with underscores etc are changed to normal text or title case (see Characters and strings)

Here are some examples of this in action:

# make display version of columns with more friendly names
linelist <- linelist %>%
  mutate(
    gender_disp = case_when(gender == "m" ~ "Male",        # m to Male 
                            gender == "f" ~ "Female",      # f to Female,
                            is.na(gender) ~ "Unknown"),    # NA to Unknown
    
    outcome_disp = replace_na(outcome, "Unknown")          # replace NA outcome with "unknown"
  )

Pivoting longer

As a matter of data structure, for ggplot2 we often also want to pivot our data into longer formats. Read more about this is the page on Pivoting data.

For example, say that we want to plot data that are in a “wide” format, such as for each case in the linelist and their symptoms. Below we create a mini-linelist called symptoms_data that contains only the case_id and symptoms columns.

symptoms_data <- linelist %>% 
  select(c(case_id, fever, chills, cough, aches, vomit))

Here is how the first 50 rows of this mini-linelist look - see how they are formatted “wide” with each symptom as a column:

If we wanted to plot the number of cases with specific symptoms, we are limited by the fact that each symptom is a specific column. However, we can pivot the symptoms columns to a longer format like this:

symptoms_data_long <- symptoms_data %>%    # begin with "mini" linelist called symptoms_data
  
  pivot_longer(
    cols = -case_id,                       # pivot all columns except case_id (all the symptoms columns)
    names_to = "symptom_name",             # assign name for new column that holds the symptoms
    values_to = "symptom_is_present") %>%  # assign name for new column that holds the values (yes/no)
  
  mutate(symptom_is_present = replace_na(symptom_is_present, "unknown")) # convert NA to "unknown"

Here are the first 50 rows. Note that case has 5 rows - one for each possible symptom. The new columns symptom_name and symptom_is_present are the result of the pivot. Note that this format may not be very useful for other operations, but is useful for plotting.

30.2 Basics of ggplot

“Grammar of graphics” - ggplot2

Plotting with ggplot2 is based on “adding” plot layers and design elements on top of one another, with each command added to the previous ones with a plus symbol (+). The result is a multi-layer plot object that can be saved, modified, printed, exported, etc.

ggplot objects can be highly complex, but the basic order of layers will usually look like this:

  1. Begin with the baseline ggplot() command - this “opens” the ggplot and allow subsequent functions to be added with +. Typically the dataset is also specified in this command
  2. Add “geom” layers - these functions visualize the data as geometries (shapes), e.g. as a bar graph, line plot, scatter plot, histogram (or a combination!). These functions all start with geom_ as a prefix.
  3. Add design elements to the plot such as axis labels, title, fonts, sizes, color schemes, legends, or axes rotation

A simple example of skeleton code is as follows. We will explain each component in the sections below.

# plot data from my_data columns as red points
ggplot(data = my_data)+                   # use the dataset "my_data"
  geom_point(                             # add a layer of points (dots)
    mapping = aes(x = col1, y = col2),    # "map" data column to axes
    color = "red")+                       # other specification for the geom
  labs()+                                 # here you add titles, axes labels, etc.
  theme()                                 # here you adjust color, font, size etc of non-data plot elements (axes, title, etc.) 

30.3 ggplot()

The opening command of any ggplot2 plot is ggplot(). This command simply creates a blank canvas upon which to add layers. It “opens” the way for further layers to be added with a + symbol.

Typically, the command ggplot() includes the data = argument for the plot. This sets the default dataset to be used for subsequent layers of the plot.

This command will end with a + after its closing parentheses. This leaves the command “open”. The ggplot will only execute/appear when the full command includes a final layer without a + at the end.

# This will create plot that is a blank canvas
ggplot(data = linelist)

30.4 Geoms

A blank canvas is certainly not sufficient - we need to create geometries (shapes) from our data (e.g. bar plots, histograms, scatter plots, box plots).

This is done by adding layers “geoms” to the initial ggplot() command. There are many ggplot2 functions that create “geoms”. Each of these functions begins with “geom_”, so we will refer to them generically as geom_XXXX(). There are over 40 geoms in ggplot2 and many others created by fans. View them at the ggplot2 gallery. Some common geoms are listed below:

  • Histograms - geom_histogram()
  • Bar charts - geom_bar() or geom_col() (see “Bar plot” section)
  • Box plots - geom_boxplot()
  • Points (e.g. scatter plots) - geom_point()
  • Line graphs - geom_line() or geom_path()
  • Trend lines - geom_smooth()

In one plot you can display one or multiple geoms. Each is added to previous ggplot2 commands with a +, and they are plotted sequentially such that later geoms are plotted on top of previous ones.

30.5 Mapping data to the plot

Most geom functions must be told what to use to create their shapes - so you must tell them how they should map (assign) columns in your data to components of the plot like the axes, shape colors, shape sizes, etc. For most geoms, the essential components that must be mapped to columns in the data are the x-axis, and (if necessary) the y-axis.

This “mapping” occurs with the mapping = argument. The mappings you provide to mapping must be wrapped in the aes() function, so you would write something like mapping = aes(x = col1, y = col2), as shown below.

Below, in the ggplot() command the data are set as the case linelist. In the mapping = aes() argument the column age is mapped to the x-axis, and the column wt_kg is mapped to the y-axis.

After a +, the plotting commands continue. A shape is created with the “geom” function geom_point(). This geom inherits the mappings from the ggplot() command above - it knows the axis-column assignments and proceeds to visualize those relationships as points on the canvas.

ggplot(data = linelist, mapping = aes(x = age, y = wt_kg))+
  geom_point()

As another example, the following commands utilize the same data, a slightly different mapping, and a different geom. The geom_histogram() function only requires a column mapped to the x-axis, as the counts y-axis is generated automatically.

ggplot(data = linelist, mapping = aes(x = age))+
  geom_histogram()

Plot aesthetics

In ggplot terminology a plot “aesthetic” has a specific meaning. It refers to a visual property of plotted data. Note that “aesthetic” here refers to the data being plotted in geoms/shapes - not the surrounding display such as titles, axis labels, background color, that you might associate with the word “aesthetics” in common English. In ggplot those details are called “themes” and are adjusted within a theme() command (see this section).

Therefore, plot object aesthetics can be colors, sizes, transparencies, placement, etc. of the plotted data. Not all geoms will have the same aesthetic options, but many can be used by most geoms. Here are some examples:

  • shape = Display a point with geom_point() as a dot, star, triangle, or square…
  • fill = The interior color (e.g. of a bar or boxplot)
  • color = The exterior line of a bar, boxplot, etc., or the point color if using geom_point()
  • size = Size (e.g. line thickness, point size)
  • alpha = Transparency (1 = opaque, 0 = invisible)
  • binwidth = Width of histogram bins
  • width = Width of “bar plot” columns
  • linetype = Line type (e.g. solid, dashed, dotted)

These plot object aesthetics can be assigned values in two ways:

  1. Assigned a static value (e.g. color = "blue") to apply across all plotted observations
  2. Assigned to a column of the data (e.g. color = hospital) such that display of each observation depends on its value in that column

Set to a static value

If you want the plot object aesthetic to be static, that is - to be the same for every observation in the data, you write its assignment within the geom but outside of any mapping = aes() statement. These assignments could look like size = 1 or color = "blue". Here are two examples:

  • In the first example, the mapping = aes() is in the ggplot() command and the axes are mapped to age and weight columns in the data. The plot aesthetics color =, size =, and alpha = (transparency) are assigned to static values. For clarity, this is done in the geom_point() function, as you may add other geoms afterward that would take different values for their plot aesthetics.
  • In the second example, the histogram requires only the x-axis mapped to a column. The histogram binwidth =, color =, fill = (internal color), and alpha = are again set within the geom to static values.
# scatterplot
ggplot(data = linelist, mapping = aes(x = age, y = wt_kg))+  # set data and axes mapping
  geom_point(color = "darkgreen", size = 0.5, alpha = 0.2)         # set static point aesthetics

# histogram
ggplot(data = linelist, mapping = aes(x = age))+       # set data and axes
  geom_histogram(              # display histogram
    binwidth = 7,                # width of bins
    color = "red",               # bin line color
    fill = "blue",               # bin interior color
    alpha = 0.1)                 # bin transparency

Scaled to column values

The alternative is to scale the plot object aesthetic by the values in a column. In this approach, the display of this aesthetic will depend on that observation’s value in that column of the data. If the column values are continuous, the display scale (legend) for that aesthetic will be continuous. If the column values are discrete, the legend will display each value and the plotted data will appear as distinctly “grouped” (read more in the grouping section of this page).

To achieve this, you map that plot aesthetic to a column name (not in quotes). This must be done within a mapping = aes() function (note: there are several places in the code you can make these mapping assignments, as discussed below).

Two examples are below.

  • In the first example, the color = aesthetic (of each point) is mapped to the column age - and a scale has appeared in a legend! For now just note that the scale exists - we will show how to modify it in later sections.
  • In the second example two new plot aesthetics are also mapped to columns (color = and size =), while the plot aesthetics shape = and alpha = are mapped to static values outside of any mapping = aes() function.
# scatterplot
ggplot(data = linelist,   # set data
       mapping = aes(     # map aesthetics to column values
         x = age,           # map x-axis to age            
         y = wt_kg,         # map y-axis to weight
         color = age)
       )+     # map color to age
  geom_point()         # display data as points 

# scatterplot
ggplot(data = linelist,   # set data
       mapping = aes(     # map aesthetics to column values
         x = age,           # map x-axis to age            
         y = wt_kg,         # map y-axis to weight
         color = age,       # map color to age
         size = age))+      # map size to age
  geom_point(             # display data as points
    shape = "diamond",      # points display as diamonds
    alpha = 0.3)            # point transparency at 30%

Note: Axes assignments are always assigned to columns in the data (not to static values), and this is always done within mapping = aes().

It becomes important to keep track of your plot layers and aesthetics when making more complex plots - for example plots with multiple geoms. In the example below, the size = aesthetic is assigned twice - once for geom_point() and once for geom_smooth() - both times as a static value.

ggplot(data = linelist,
       mapping = aes(           # map aesthetics to columns
         x = age,
         y = wt_kg,
         color = age_years)
       ) + 
  geom_point(                   # add points for each row of data
    size = 1,
    alpha = 0.5) +  
  geom_smooth(                  # add a trend line 
    method = "lm",              # with linear method
    size = 2)                   # size (width of line) of 2

Where to make mapping assignments

Aesthetic mapping within mapping = aes() can be written in several places in your plotting commands and can even be written more than once. This can be written in the top ggplot() command, and/or for each individual geom beneath. The nuances include:

  • Mapping assignments made in the top ggplot() command will be inherited as defaults across any geom below, like how x = and y = are inherited
  • Mapping assignments made within one geom apply only to that geom

Likewise, data = specified in the top ggplot() will apply by default to any geom below, but you could also specify data for each geom (but this is more difficult).

Thus, each of the following commands will create the same plot:

# These commands will produce the exact same plot
ggplot(data = linelist, mapping = aes(x = age))+
  geom_histogram()

ggplot(data = linelist)+
  geom_histogram(mapping = aes(x = age))

ggplot()+
  geom_histogram(data = linelist, mapping = aes(x = age))

Groups

You can easily group the data and “plot by group”. In fact, you have already done this!

Assign the “grouping” column to the appropriate plot aesthetic, within a mapping = aes(). Above, we demonstrated this using continuous values when we assigned point size = to the column age. However this works the same way for discrete/categorical columns.

For example, if you want points to be displayed by gender, you would set mapping = aes(color = gender). A legend automatically appears. This assignment can be made within the mapping = aes() in the top ggplot() command (and be inherited by the geom), or it could be set in a separate mapping = aes() within the geom. Both approaches are shown below:

ggplot(data = linelist,
       mapping = aes(x = age, y = wt_kg, color = gender))+
  geom_point(alpha = 0.5)

# This alternative code produces the same plot
ggplot(data = linelist,
       mapping = aes(x = age, y = wt_kg))+
  geom_point(
    mapping = aes(color = gender),
    alpha = 0.5)

Note that depending on the geom, you will need to use different arguments to group the data. For geom_point() you will most likely use color =, shape = or size =. Whereas for geom_bar() you are more likely to use fill =. This just depends on the geom and what plot aesthetic you want to reflect the groupings.

For your information - the most basic way of grouping the data is by using only the group = argument within mapping = aes(). However, this by itself will not change the colors, fill, or shapes. Nor will it create a legend. Yet the data are grouped, so statistical displays may be affected.

To adjust the order of groups in a plot, see the ggplot tips page or the page on Factors. There are many examples of grouped plots in the sections below on plotting continuous and categorical data.

30.6 Facets / Small-multiples

Facets, or “small-multiples”, are used to split one plot into a multi-panel figure, with one panel (“facet”) per group of data. The same type of plot is created multiple times, each one using a sub-group of the same dataset.

Faceting is a functionality that comes with ggplot2, so the legends and axes of the facet “panels” are automatically aligned. There are other packages discussed in the ggplot tips page that are used to combine completely different plots (cowplot and patchwork) into one figure.

Faceting is done with one of the following ggplot2 functions:

  1. facet_wrap() To show a different panel for each level of a single variable. One example of this could be showing a different epidemic curve for each hospital in a region. Facets are ordered alphabetically, unless the variable is a factor with other ordering defined.
  • You can invoke certain options to determine the layout of the facets, e.g. nrow = 1 or ncol = 1 to control the number of rows or columns that the faceted plots are arranged within.
  1. facet_grid() This is used when you want to bring a second variable into the faceting arrangement. Here each panel of a grid shows the intersection between values in two columns. For example, epidemic curves for each hospital-age group combination with hospitals along the top (columns) and age groups along the sides (rows).
  • nrow and ncol are not relevant, as the subgroups are presented in a grid

Each of these functions accept a formula syntax to specify the column(s) for faceting. Both accept up to two columns, one on each side of a tilde ~.

  • For facet_wrap() most often you will write only one column preceded by a tilde ~ like facet_wrap(~hospital). However you can write two columns facet_wrap(outcome ~ hospital) - each unique combination will display in a separate panel, but they will not be arranged in a grid. The headings will show combined terms and these won’t be specific logic to the columns vs. rows. If you are providing only one faceting variable, a period . is used as a placeholder on the other side of the formula - see the code examples.

  • For facet_grid() you can also specify one or two columns to the formula (grid rows ~ columns). If you only want to specify one, you can place a period . on the other side of the tilde like facet_grid(. ~ hospital) or facet_grid(hospital ~ .).

Facets can quickly contain an overwhelming amount of information - its good to ensure you don’t have too many levels of each variable that you choose to facet by. Here are some quick examples with the malaria dataset (see Download handbook and data) which consists of daily case counts of malaria for facilities, by age group.

Below we import and do some quick modifications for simplicity:

# These data are daily counts of malaria cases, by facility-day
malaria_data <- import(here("data", "malaria_facility_count_data.rds")) %>%  # import
  select(-submitted_date, -Province, -newid)                                 # remove unneeded columns

The first 50 rows of the malaria data are below. Note there is a column malaria_tot, but also columns for counts by age group (these will be used in the second, facet_grid() example).

facet_wrap()

For the moment, let’s focus on the columns malaria_tot and District. Ignore the age-specific count columns for now. We will plot epidemic curves with geom_col(), which produces a column for each day at the specified y-axis height given in column malaria_tot (the data are already daily counts, so we use geom_col() - see the “Bar plot” section below).

When we add the command facet_wrap(), we specify a tilde and then the column to facet on (District in this case). You can place another column on the left side of the tilde, - this will create one facet for each combination - but we recommend you do this with facet_grid() instead. In this use case, one facet is created for each unique value of District.

# A plot with facets by district
ggplot(malaria_data, aes(x = data_date, y = malaria_tot)) +
  geom_col(width = 1, fill = "darkred") +       # plot the count data as columns
  theme_minimal()+                              # simplify the background panels
  labs(                                         # add plot labels, title, etc.
    x = "Date of report",
    y = "Malaria cases",
    title = "Malaria cases by district") +
  facet_wrap(~District)                       # the facets are created

facet_grid()

We can use a facet_grid() approach to cross two variables. Let’s say we want to cross District and age. Well, we need to do some data transformations on the age columns to get these data into ggplot-preferred “long” format. The age groups all have their own columns - we want them in a single column called age_group and another called num_cases. See the page on Pivoting data for more information on this process.

malaria_age <- malaria_data %>%
  select(-malaria_tot) %>% 
  pivot_longer(
    cols = c(starts_with("malaria_rdt_")),  # choose columns to pivot longer
    names_to = "age_group",      # column names become age group
    values_to = "num_cases"      # values to a single column (num_cases)
  ) %>%
  mutate(
    age_group = str_replace(age_group, "malaria_rdt_", ""),
    age_group = forcats::fct_relevel(age_group, "5-14", after = 1))

Now the first 50 rows of data look like this:

When you pass the two variables to facet_grid(), easiest is to use formula notation (e.g. x ~ y) where x is rows and y is columns. Here is the plot, using facet_grid() to show the plots for each combination of the columns age_group and District.

ggplot(malaria_age, aes(x = data_date, y = num_cases)) +
  geom_col(fill = "darkred", width = 1) +
  theme_minimal()+
  labs(
    x = "Date of report",
    y = "Malaria cases",
    title = "Malaria cases by district and age group"
  ) +
  facet_grid(District ~ age_group)

Free or fixed axes

The axes scales displayed when faceting are by default the same (fixed) across all the facets. This is helpful for cross-comparison, but not always appropriate.

When using facet_wrap() or facet_grid(), we can add scales = "free_y" to “free” or release the y-axes of the panels to scale appropriately to their data subset. This is particularly useful if the actual counts are small for one of the subcategories and trends are otherwise hard to see. Instead of “free_y” we can also write “free_x” to do the same for the x-axis (e.g. for dates) or “free” for both axes. Note that in facet_grid, the y scales will be the same for facets in the same row, and the x scales will be the same for facets in the same column.

When using facet_grid only, we can add space = "free_y" or space = "free_x" so that the actual height or width of the facet is weighted to the values of the figure within. This only works if scales = "free" (y or x) is already applied.

# Free y-axis
ggplot(malaria_data, aes(x = data_date, y = malaria_tot)) +
  geom_col(width = 1, fill = "darkred") +       # plot the count data as columns
  theme_minimal()+                              # simplify the background panels
  labs(                                         # add plot labels, title, etc.
    x = "Date of report",
    y = "Malaria cases",
    title = "Malaria cases by district - 'free' x and y axes") +
  facet_wrap(~District, scales = "free")        # the facets are created

Factor level order in facets

See this post on how to re-order factor levels within facets.

30.7 Storing plots

Saving plots

By default when you run a ggplot() command, the plot will be printed to the Plots RStudio pane. However, you can also save the plot as an object by using the assignment operator <- and giving it a name. Then it will not print unless the object name itself is run. You can also print it by wrapping the plot name with print(), but this is only necessary in certain circumstances such as if the plot is created inside a for loop used to print multiple plots at once (see Iteration, loops, and lists page).

# define plot
age_by_wt <- ggplot(data = linelist, mapping = aes(x = age_years, y = wt_kg, color = age_years))+
  geom_point(alpha = 0.1)

# print
age_by_wt    

Modifying saved plots

One nice thing about ggplot2 is that you can define a plot (as above), and then add layers to it starting with its name. You do not have to repeat all the commands that created the original plot!

For example, to modify the plot age_by_wt that was defined above, to include a vertical line at age 50, we would just add a + and begin adding additional layers to the plot.

age_by_wt+
  geom_vline(xintercept = 50)

Exporting plots

Exporting ggplots is made easy with the ggsave() function from ggplot2. It can work in two ways, either:

  • Specify the name of the plot object, then the file path and name with extension
    • For example: ggsave(my_plot, here("plots", "my_plot.png"))
  • Run the command with only a file path, to save the last plot that was printed
    • For example: ggsave(here("plots", "my_plot.png"))

You can export as png, pdf, jpeg, tiff, bmp, svg, or several other file types, by specifying the file extension in the file path.

You can also specify the arguments width =, height =, and units = (either “in”, “cm”, or “mm”). You can also specify dpi = with a number for plot resolution (e.g. 300). See the function details by entering ?ggsave or reading the documentation online.

Remember that you can use here() syntax to provide the desired file path. See the Import and export page for more information.

30.8 Labels

Surely you will want to add or adjust the plot’s labels. These are most easily done within the labs() function which is added to the plot with + just as the geoms were.

Within labs() you can provide character strings to these arguements:

  • x = and y = The x-axis and y-axis title (labels)
  • title = The main plot title
  • subtitle = The subtitle of the plot, in smaller text below the title
  • caption = The caption of the plot, in bottom-right by default

Here is a plot we made earlier, but with nicer labels:

age_by_wt <- ggplot(
  data = linelist,   # set data
  mapping = aes(     # map aesthetics to column values
         x = age,           # map x-axis to age            
         y = wt_kg,         # map y-axis to weight
         color = age))+     # map color to age
  geom_point()+           # display data as points
  labs(
    title = "Age and weight distribution",
    subtitle = "Fictional Ebola outbreak, 2014",
    x = "Age in years",
    y = "Weight in kilos",
    color = "Age",
    caption = stringr::str_glue("Data as of {max(linelist$date_hospitalisation, na.rm=T)}"))

age_by_wt

Note how in the caption assignment we used str_glue() from the stringr package to implant dynamic R code within the string text. The caption will show the “Data as of:” date that reflects the maximum hospitalization date in the linelist. Read more about this in the page on Characters and strings.

A note on specifying the legend title: There is no one “legend title” argument, as you could have multiple scales in your legend. Within labs(), you can write the argument for the plot aesthetic used to create the legend, and provide the title this way. For example, above we assigned color = age to create the legend. Therefore, we provide color = to labs() and assign the legend title desired (“Age” with capital A). If you create the legend with aes(fill = COLUMN), then in labs() you would write fill = to adjust the title of that legend. The section on color scales in the ggplot tips page provides more details on editing legends, and an alternative approach using scales_() functions.

30.9 Themes

One of the best parts of ggplot2 is the amount of control you have over the plot - you can define anything! As mentioned above, the design of the plot that is not related to the data shapes/geometries are adjusted within the theme() function. For example, the plot background color, presence/absence of gridlines, and the font/size/color/alignment of text (titles, subtitles, captions, axis text…). These adjustments can be done in one of two ways:

  • Add a complete theme theme_() function to make sweeping adjustments - these include theme_classic(), theme_minimal(), theme_dark(), theme_light() theme_grey(), theme_bw() among others
  • Adjust each tiny aspect of the plot individually within theme()

Complete themes

As they are quite straight-forward, we will demonstrate the complete theme functions below and will not describe them further here. Note that any micro-adjustments with theme() should be made after use of a complete theme.

Write them with empty parentheses.

ggplot(data = linelist, mapping = aes(x = age, y = wt_kg))+  
  geom_point(color = "darkgreen", size = 0.5, alpha = 0.2)+
  labs(title = "Theme classic")+
  theme_classic()

ggplot(data = linelist, mapping = aes(x = age, y = wt_kg))+  
  geom_point(color = "darkgreen", size = 0.5, alpha = 0.2)+
  labs(title = "Theme bw")+
  theme_bw()

ggplot(data = linelist, mapping = aes(x = age, y = wt_kg))+  
  geom_point(color = "darkgreen", size = 0.5, alpha = 0.2)+
  labs(title = "Theme minimal")+
  theme_minimal()

ggplot(data = linelist, mapping = aes(x = age, y = wt_kg))+  
  geom_point(color = "darkgreen", size = 0.5, alpha = 0.2)+
  labs(title = "Theme gray")+
  theme_gray()

Modify theme

The theme() function can take a large number of arguments, each of which edits a very specific aspect of the plot. There is no way we could cover all of the arguments, but we will describe the general pattern for them and show you how to find the argument name that you need. The basic syntax is this:

  1. Within theme() write the argument name for the plot element you want to edit, like plot.title =
  2. Provide an element_() function to the argument
  • Most often, use element_text(), but others include element_rect() for canvas background colors, or element_blank() to remove plot elements
  1. Within the element_() function, write argument assignments to make the fine adjustments you desire

So, that description was quite abstract, so here are some examples.

The below plot looks quite silly, but it serves to show you a variety of the ways you can adjust your plot.

  • We begin with the plot age_by_wt defined just above and add theme_classic()
  • For finer adjustments we add theme() and include one argument for each plot element to adjust

It can be nice to organize the arguments in logical sections. To describe just some of those used below:

  • legend.position = is unique in that it accepts simple values like “bottom”, “top”, “left”, and “right”. But generally, text-related arguments require that you place the details within element_text().
  • Title size with element_text(size = 30)
  • The caption horizontal alignment with element_text(hjust = 0) (from right to left)
  • The subtitle is italicized with element_text(face = "italic")
age_by_wt + 
  theme_classic()+                                 # pre-defined theme adjustments
  theme(
    legend.position = "bottom",                    # move legend to bottom
    
    plot.title = element_text(size = 30),          # size of title to 30
    plot.caption = element_text(hjust = 0),        # left-align caption
    plot.subtitle = element_text(face = "italic"), # italicize subtitle
    
    axis.text.x = element_text(color = "red", size = 15, angle = 90), # adjusts only x-axis text
    axis.text.y = element_text(size = 15),         # adjusts only y-axis text
    
    axis.title = element_text(size = 20)           # adjusts both axes titles
    )     

Here are some especially common theme() arguments. You will recognize some patterns, such as appending .x or .y to apply the change only to one axis.

theme() argument What it adjusts
plot.title = element_text() The title
plot.subtitle = element_text() The subtitle
plot.caption = element_text() The caption (family, face, color, size, angle, vjust, hjust…)
axis.title = element_text() Axis titles (both x and y) (size, face, angle, color…)
axis.title.x = element_text() Axis title x-axis only (use .y for y-axis only)
axis.text = element_text() Axis text (both x and y)
axis.text.x = element_text() Axis text x-axis only (use .y for y-axis only)
axis.ticks = element_blank() Remove axis ticks
axis.line = element_line() Axis lines (colour, size, linetype: solid dashed dotted etc)
strip.text = element_text() Facet strip text (colour, face, size, angle…)
strip.background = element_rect() facet strip (fill, colour, size…)

But there are so many theme arguments! How could I remember them all? Do not worry - it is impossible to remember them all. Luckily there are a few tools to help you:

The tidyverse documentation on modifying theme, which has a complete list.

TIP: Run theme_get() from ggplot2 to print a list of all 90+ theme() arguments to the console.

TIP: If you ever want to remove an element of a plot, you can also do it through theme(). Just pass element_blank() to an argument to have it disappear completely. For legends, set legend.position = "none".

30.10 Colors

Please see this section on color scales of the ggplot tips page.

30.11 Piping into ggplot2

When using pipes to clean and transform your data, it is easy to pass the transformed data into ggplot().

The pipes that pass the dataset from function-to-function will transition to + once the ggplot() function is called. Note that in this case, there is no need to specify the data = argument, as this is automatically defined as the piped-in dataset.

This is how that might look:

linelist %>%                                                     # begin with linelist
  select(c(case_id, fever, chills, cough, aches, vomit)) %>%     # select columns
  pivot_longer(                                                  # pivot longer
    cols = -case_id,                                  
    names_to = "symptom_name",
    values_to = "symptom_is_present") %>%
  mutate(                                                        # replace missing values
    symptom_is_present = replace_na(symptom_is_present, "unknown")) %>% 
  
  ggplot(                                                        # begin ggplot!
    mapping = aes(x = symptom_name, fill = symptom_is_present))+
  geom_bar(position = "fill", col = "black") +                    
  theme_classic() +
  labs(
    x = "Symptom",
    y = "Symptom status (proportion)"
  )

30.12 Plot continuous data

Throughout this page, you have already seen many examples of plotting continuous data. Here we briefly consolidate these and present a few variations.
Visualisations covered here include:

  • Plots for one continuous variable:
    • Histogram, a classic graph to present the distribution of a continuous variable.
    • Box plot (also called box and whisker), to show the 25th, 50th, and 75th percentiles, tail ends of the distribution, and outliers (important limitations).
    • Jitter plot, to show all values as points that are ‘jittered’ so they can (mostly) all be seen, even where two have the same value.
    • Violin plot, show the distribution of a continuous variable based on the symmetrical width of the ‘violin’.
    • Sina plot, are a combination of jitter and violin plots, where individual points are shown but in the symmetrical shape of the distribution (via ggforce package).
  • Scatter plot for two continuous variables.
  • Heat plots for three continuous variables (linked to Heat plots page)

Histograms

Histograms may look like bar charts, but are distinct because they measure the distribution of a continuous variable. There are no spaces between the “bars”, and only one column is provided to geom_histogram().

Below is code for generating histograms, which group continuous data into ranges and display in adjacent bars of varying height. This is done using geom_histogram(). See the “Bar plot” section of the ggplot basics page to understand difference between geom_histogram(), geom_bar(), and geom_col().

We will show the distribution of ages of cases. Within mapping = aes() specify which column you want to see the distribution of. You can assign this column to either the x or the y axis.

The rows will be assigned to “bins” based on their numeric age, and these bins will be graphically represented by bars. If you specify a number of bins with the bins = plot aesthetic, the break points are evenly spaced between the minimum and maximum values of the histogram. If bins = is unspecified, an appropriate number of bins will be guessed and this message displayed after the plot:

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

If you do not want to specify a number of bins to bins =, you could alternatively specify binwidth = in the units of the axis. We give a few examples showing different bins and bin widths:

# A) Regular histogram
ggplot(data = linelist, aes(x = age))+  # provide x variable
  geom_histogram()+
  labs(title = "A) Default histogram (30 bins)")

# B) More bins
ggplot(data = linelist, aes(x = age))+  # provide x variable
  geom_histogram(bins = 50)+
  labs(title = "B) Set to 50 bins")

# C) Fewer bins
ggplot(data = linelist, aes(x = age))+  # provide x variable
  geom_histogram(bins = 5)+
  labs(title = "C) Set to 5 bins")


# D) More bins
ggplot(data = linelist, aes(x = age))+  # provide x variable
  geom_histogram(binwidth = 1)+
  labs(title = "D) binwidth of 1")

To get smoothed proportions, you can use geom_density():

# Frequency with proportion axis, smoothed
ggplot(data = linelist, mapping = aes(x = age)) +
  geom_density(size = 2, alpha = 0.2)+
  labs(title = "Proportional density")

# Stacked frequency with proportion axis, smoothed
ggplot(data = linelist, mapping = aes(x = age, fill = gender)) +
  geom_density(size = 2, alpha = 0.2, position = "stack")+
  labs(title = "'Stacked' proportional densities")

To get a “stacked” histogram (of a continuous column of data), you can do one of the following:

  1. Use geom_histogram() with the fill = argument within aes() and assigned to the grouping column, or
  2. Use geom_freqpoly(), which is likely easier to read (you can still set binwidth =)
  3. To see proportions of all values, set the y = after_stat(density) (use this syntax exactly - not changed for your data). Note: these proportions will show per group.

Each is shown below (*note use of color = vs. fill = in each):

# "Stacked" histogram
ggplot(data = linelist, mapping = aes(x = age, fill = gender)) +
  geom_histogram(binwidth = 2)+
  labs(title = "'Stacked' histogram")

# Frequency 
ggplot(data = linelist, mapping = aes(x = age, color = gender)) +
  geom_freqpoly(binwidth = 2, size = 2)+
  labs(title = "Freqpoly")

# Frequency with proportion axis
ggplot(data = linelist, mapping = aes(x = age, y = after_stat(density), color = gender)) +
  geom_freqpoly(binwidth = 5, size = 2)+
  labs(title = "Proportional freqpoly")

# Frequency with proportion axis, smoothed
ggplot(data = linelist, mapping = aes(x = age, y = after_stat(density), fill = gender)) +
  geom_density(size = 2, alpha = 0.2)+
  labs(title = "Proportional, smoothed with geom_density()")

If you want to have some fun, try geom_density_ridges from the ggridges package (vignette here.

Read more in detail about histograms at the tidyverse page on geom_histogram().

Box plots

Box plots are common, but have important limitations. They can obscure the actual distribution - e.g. a bi-modal distribution. See this R graph gallery and this data-to-viz article for more details. However, they do nicely display the inter-quartile range and outliers - so they can be overlaid on top of other types of plots that show the distribution in more detail.

Below we remind you of the various components of a boxplot:

When using geom_boxplot() to create a box plot, you generally map only one axis (x or y) within aes(). The axis specified determines if the plots are horizontal or vertical.

In most geoms, you create a plot per group by mapping an aesthetic like color = or fill = to a column within aes(). However, for box plots achieve this by assigning the grouping column to the un-assigned axis (x or y). Below is code for a boxplot of all age values in the dataset, and second is code to display one box plot for each (non-missing) gender in the dataset. Note that NA (missing) values will appear as a separate box plot unless removed. In this example we also set the fill to the column outcome so each plot is a different color - but this is not necessary.

# A) Overall boxplot
ggplot(data = linelist)+  
  geom_boxplot(mapping = aes(y = age))+   # only y axis mapped (not x)
  labs(title = "A) Overall boxplot")

# B) Box plot by group
ggplot(data = linelist, mapping = aes(y = age, x = gender, fill = gender)) + 
  geom_boxplot()+                     
  theme(legend.position = "none")+   # remove legend (redundant)
  labs(title = "B) Boxplot by gender")      

For code to add a box plot to the edges of a scatter plot (“marginal” plots) see the page ggplot tips.

Violin, jitter, and sina plots

Below is code for creating violin plots (geom_violin) and jitter plots (geom_jitter) to show distributions. You can specify that the fill or color is also determined by the data, by inserting these options within aes().

# A) Jitter plot by group
ggplot(data = linelist %>% drop_na(outcome),      # remove missing values
       mapping = aes(y = age,                     # Continuous variable
           x = outcome,                           # Grouping variable
           color = outcome))+                     # Color variable
  geom_jitter()+                                  # Create the violin plot
  labs(title = "A) jitter plot by gender")     



# B) Violin plot by group
ggplot(data = linelist %>% drop_na(outcome),       # remove missing values
       mapping = aes(y = age,                      # Continuous variable
           x = outcome,                            # Grouping variable
           fill = outcome))+                       # fill variable (color)
  geom_violin()+                                   # create the violin plot
  labs(title = "B) violin plot by gender")    

You can combine the two using the geom_sina() function from the ggforce package. The sina plots the jitter points in the shape of the violin plot. When overlaid on the violin plot (adjusting the transparencies) this can be easier to visually interpret.

# A) Sina plot by group
ggplot(
  data = linelist %>% drop_na(outcome), 
  aes(y = age,           # numeric variable
      x = outcome)) +    # group variable
  geom_violin(
    aes(fill = outcome), # fill (color of violin background)
    color = "white",     # white outline
    alpha = 0.2)+        # transparency
  geom_sina(
    size=1,                # Change the size of the jitter
    aes(color = outcome))+ # color (color of dots)
  scale_fill_manual(       # Define fill for violin background by death/recover
    values = c("Death" = "#bf5300", 
              "Recover" = "#11118c")) + 
  scale_color_manual(      # Define colours for points by death/recover
    values = c("Death" = "#bf5300", 
              "Recover" = "#11118c")) + 
  theme_minimal() +                                # Remove the gray background
  theme(legend.position = "none") +                # Remove unnecessary legend
  labs(title = "B) violin and sina plot by gender, with extra formatting")      

Two continuous variables

Following similar syntax, geom_point() will allow you to plot two continuous variables against each other in a scatter plot. This is useful for showing actual values rather than their distributions. A basic scatter plot of age vs weight is shown in (A). In (B) we again use facet_grid() to show the relationship between two continuous variables in the linelist.

# Basic scatter plot of weight and age
ggplot(data = linelist, 
       mapping = aes(y = wt_kg, x = age))+
  geom_point() +
  labs(title = "A) Scatter plot of weight and age")

# Scatter plot of weight and age by gender and Ebola outcome
ggplot(data = linelist %>% drop_na(gender, outcome), # filter retains non-missing gender/outcome
       mapping = aes(y = wt_kg, x = age))+
  geom_point() +
  labs(title = "B) Scatter plot of weight and age faceted by gender and outcome")+
  facet_grid(gender ~ outcome) 

Three continuous variables

You can display three continuous variables by utilizing the fill = argument to create a heat plot. The color of each “cell” will reflect the value of the third continuous column of data. See the ggplot tips page and the page on on Heat plots for more details and several examples.

There are ways to make 3D plots in R, but for applied epidemiology these are often difficult to interpret and therefore less useful for decision-making.

30.13 Plot categorical data

Categorical data can be character values, could be logical (TRUE/FALSE), or factors (see the Factors page).

Preparation

Data structure

The first thing to understand about your categorical data is whether it exists as raw observations like a linelist of cases, or as a summary or aggregate data frame that holds counts or proportions. The state of your data will impact which plotting function you use:

  • If your data are raw observations with one row per observation, you will likely use geom_bar()
  • If your data are already aggregated into counts or proportions, you will likely use geom_col()

Column class and value ordering

Next, examine the class of the columns you want to plot. We look at hospital, first with class() from base R, and with tabyl() from janitor.

# View class of hospital column - we can see it is a character
class(linelist$hospital)
## [1] "character"
# Look at values and proportions within hospital column
linelist %>% 
  tabyl(hospital)
##                              hospital    n    percent
##                      Central Hospital  454 0.07710598
##                     Military Hospital  896 0.15217391
##                               Missing 1469 0.24949049
##                                 Other  885 0.15030571
##                         Port Hospital 1762 0.29925272
##  St. Mark's Maternity Hospital (SMMH)  422 0.07167120

We can see the values within are characters, as they are hospital names, and by default they are ordered alphabetically. There are ‘other’ and ‘missing’ values, which we would prefer to be the last subcategories when presenting breakdowns. So we change this column into a factor and re-order it. This is covered in more detail in the Factors page.

# Convert to factor and define level order so "Other" and "Missing" are last
linelist <- linelist %>% 
  mutate(
    hospital = fct_relevel(hospital, 
      "St. Mark's Maternity Hospital (SMMH)",
      "Port Hospital", 
      "Central Hospital",
      "Military Hospital",
      "Other",
      "Missing"))
levels(linelist$hospital)
## [1] "St. Mark's Maternity Hospital (SMMH)" "Port Hospital"                        "Central Hospital"                     "Military Hospital"                   
## [5] "Other"                                "Missing"

geom_bar()

Use geom_bar() if you want bar height (or the height of stacked bar components) to reflect the number of relevant rows in the data. These bars will have gaps between them, unless the width = plot aesthetic is adjusted.

  • Provide only one axis column assignment (typically x-axis). If you provide x and y, you will get Error: stat_count() can only have an x or y aesthetic.
  • You can create stacked bars by adding a fill = column assignment within mapping = aes()
  • The opposite axis will be titled “count” by default, because it represents the number of rows

Below, we have assigned outcome to the y-axis, but it could just as easily be on the x-axis. If you have longer character values, it can sometimes look better to flip the bars sideways and put the legend on the bottom. This may impact how your factor levels are ordered - in this case we reverse them with fct_rev() to put missing and other at the bottom.

# A) Outcomes in all cases
ggplot(linelist %>% drop_na(outcome)) + 
  geom_bar(aes(y = fct_rev(hospital)), width = 0.7) +
  theme_minimal()+
  labs(title = "A) Number of cases by hospital",
       y = "Hospital")


# B) Outcomes in all cases by hosptial
ggplot(linelist %>% drop_na(outcome)) + 
  geom_bar(aes(y = fct_rev(hospital), fill = outcome), width = 0.7) +
  theme_minimal()+
  theme(legend.position = "bottom") +
  labs(title = "B) Number of recovered and dead Ebola cases, by hospital",
       y = "Hospital")

geom_col()

Use geom_col() if you want bar height (or height of stacked bar components) to reflect pre-calculated values that exists in the data. Often, these are summary or “aggregated” counts, or proportions.

Provide column assignments for both axes to geom_col(). Typically your x-axis column is discrete and your y-axis column is numeric.

Let’s say we have this dataset outcomes:

## # A tibble: 2 x 3
##   outcome     n proportion
##   <chr>   <int>      <dbl>
## 1 Death    1022       56.2
## 2 Recover   796       43.8

Below is code using geom_col for creating simple bar charts to show the distribution of Ebola patient outcomes. With geom_col, both x and y need to be specified. Here x is the categorical variable along the x axis, and y is the generated proportions column proportion.

# Outcomes in all cases
ggplot(outcomes) + 
  geom_col(aes(x=outcome, y = proportion)) +
  labs(subtitle = "Number of recovered and dead Ebola cases")

To show breakdowns by hospital, we would need our table to contain more information, and to be in “long” format. We create this table with the frequencies of the combined categories outcome and hospital (see Grouping data page for grouping tips).

outcomes2 <- linelist %>% 
  drop_na(outcome) %>% 
  count(hospital, outcome) %>%  # get counts by hospital and outcome
  group_by(hospital) %>%        # Group so proportions are out of hospital total
  mutate(proportion = n/sum(n)*100) # calculate proportions of hospital total

head(outcomes2) # Preview data
## # A tibble: 6 x 4
## # Groups:   hospital [3]
##   hospital                             outcome     n proportion
##   <fct>                                <chr>   <int>      <dbl>
## 1 St. Mark's Maternity Hospital (SMMH) Death     199       61.2
## 2 St. Mark's Maternity Hospital (SMMH) Recover   126       38.8
## 3 Port Hospital                        Death     785       57.6
## 4 Port Hospital                        Recover   579       42.4
## 5 Central Hospital                     Death     193       53.9
## 6 Central Hospital                     Recover   165       46.1

We then create the ggplot with some added formatting:

  • Axis flip: Swapped the axis around with coord_flip() so that we can read the hospital names.
  • Columns side-by-side: Added a position = "dodge" argument so that the bars for death and recover are presented side by side rather than stacked. Note stacked bars are the default.
  • Column width: Specified ‘width’, so the columns are half as thin as the full possible width.
  • Column order: Reversed the order of the categories on the y axis so that ‘Other’ and ‘Missing’ are at the bottom, with scale_x_discrete(limits=rev). Note that we used that rather than scale_y_discrete because hospital is stated in the x argument of aes(), even if visually it is on the y axis. We do this because Ggplot seems to present categories backwards unless we tell it not to.
  • Other details: Labels/titles and colours added within labs and scale_fill_color respectively.
# Outcomes in all cases by hospital
ggplot(outcomes2) +  
  geom_col(
    mapping = aes(
      x = proportion,                 # show pre-calculated proportion values
      y = fct_rev(hospital),          # reverse level order so missing/other at bottom
      fill = outcome),                # stacked by outcome
    width = 0.5)+                    # thinner bars (out of 1)
  theme_minimal() +                  # Minimal theme 
  theme(legend.position = "bottom")+
  labs(subtitle = "Number of recovered and dead Ebola cases, by hospital",
       fill = "Outcome",             # legend title
       y = "Count",                  # y axis title
       x = "Hospital of admission")+ # x axis title
  scale_fill_manual(                 # adding colors manually
    values = c("Death"= "#3B1c8C",
               "Recover" = "#21908D" )) 

Note that the proportions are binary, so we may prefer to drop ‘recover’ and just show the proportion who died. This is just for illustration purposes.

If using geom_col() with dates data (e.g. an epicurve from aggregated data) - you will want to adjust the width = argument to remove the “gap” lines between the bars. If using daily data set width = 1. If weekly, width = 7. Months are not possible because each month has a different number of days.

geom_histogram()

Histograms may look like bar charts, but are distinct because they measure the distribution of a continuous variable. There are no spaces between the “bars”, and only one column is provided to geom_histogram(). There are arguments specific to histograms such as bin_width = and breaks = to specify how the data should be binned. The section above on continuous data and the page on Epidemic curves provide additional detail.

30.14 Resources

There is a huge amount of help online, especially with ggplot. See:

31 ggplot tips

In this page we will cover tips and tricks to make your ggplots sharp and fancy. See the page on ggplot basics for the fundamentals.

There a several extensive ggplot2 tutorials linked in the Resources section. You can also download this data visualization with ggplot cheatsheet from the RStudio website. We strongly recommend that you peruse for inspiration at the R graph gallery and Data-to-viz.

31.1 Preparation

Load packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
  tidyverse,      # includes ggplot2 and other
  rio,            # import/export
  here,           # file locator
  stringr,        # working with characters   
  scales,         # transform numbers
  ggrepel,        # smartly-placed labels
  gghighlight,    # highlight one part of plot
  RColorBrewer    # color scales
)

Import data

For this page, we import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import data with the import() function from the rio package (it handles many file types like .xlsx, .csv, .rds - see the Import and export page for details).

linelist <- rio::import("linelist_cleaned.rds")

The first 50 rows of the linelist are displayed below.

31.2 Scales for color, fill, axes, etc.

In ggplot2, when aesthetics of plotted data (e.g. size, color, shape, fill, plot axis) are mapped to columns in the data, the exact display can be adjusted with the corresponding “scale” command. In this section we explain some common scale adjustments.

31.2.1 Color schemes

One thing that can initially be difficult to understand with ggplot2 is control of color schemes. Note that this section discusses the color of plot objects (geoms/shapes) such as points, bars, lines, tiles, etc. To adjust color of accessory text, titles, or background color see the Themes section of the ggplot basics page.

To control “color” of plot objects you will be adjusting either the color = aesthetic (the exterior color) or the fill = aesthetic (the interior color). One exception to this pattern is geom_point(), where you really only get to control color =, which controls the color of the point (interior and exterior).

When setting colour or fill you can use colour names recognized by R like "red" (see complete list or enter ?colors), or a specific hex colour such as "#ff0505".

# histogram - 
ggplot(data = linelist, mapping = aes(x = age))+       # set data and axes
  geom_histogram(              # display histogram
    binwidth = 7,                # width of bins
    color = "red",               # bin line color
    fill = "lightblue")          # bin interior color (fill) 

As explained the ggplot basics section on mapping data to the plot, aesthetics such as fill = and color = can be defined either outside of a mapping = aes() statement or inside of one. If outside the aes(), the assigned value should be static (e.g. color = "blue") and will apply for all data plotted by the geom. If inside, the aesthetic should be mapped to a column, like color = hospital, and the expression will vary by the value for that row in the data. A few examples:

# Static color for points and for line
ggplot(data = linelist, mapping = aes(x = age, y = wt_kg))+     
  geom_point(color = "purple")+
  geom_vline(xintercept = 50, color = "orange")+
  labs(title = "Static color for points and line")

# Color mapped to continuous column
ggplot(data = linelist, mapping = aes(x = age, y = wt_kg))+     
  geom_point(mapping = aes(color = temp))+         
  labs(title = "Color mapped to continuous column")

# Color mapped to discrete column
ggplot(data = linelist, mapping = aes(x = age, y = wt_kg))+     
  geom_point(mapping = aes(color = gender))+         
  labs(title = "Color mapped to discrete column")

# bar plot, fill to discrete column, color to static value
ggplot(data = linelist, mapping = aes(x = hospital))+     
  geom_bar(mapping = aes(fill = gender), color = "yellow")+         
  labs(title = "Fill mapped to discrete column, static color")

Scales

Once you map a column to a plot aesthetic (e.g. x =, y =, fill =, color =…), your plot will gain a scale/legend. See above how the scale can be continuous, discrete, date, etc. values depending on the class of the assigned column. If you have multiple aesthetics mapped to columns, your plot will have multiple scales.

You can control the scales with the appropriate scales_() function. The scale functions of ggplot() have 3 parts that are written like this: scale_AESTHETIC_METHOD().

  1. The first part, scale_(), is fixed.
  2. The second part, the AESTHETIC, should be the aesthetic that you want to adjust the scale for (_fill_, _shape_, _color_, _size_, _alpha_…) - the options here also include _x_ and _y_.
  3. The third part, the METHOD, will be either _discrete(), continuous(), _date(), _gradient(), or _manual() depending on the class of the column and how you want to control it. There are others, but these are the most-often used.

Be sure that you use the correct function for the scale! Otherwise your scale command will not appear to change anything. If you have multiple scales, you may use multiple scale functions to adjust them! For example:

Scale arguments

Each kind of scale has its own arguments, though there is some overlap. Query the function like ?scale_color_discrete in the R console to see the function argument documentation.

For continuous scales, use breaks = to provide a sequence of values with seq() (take to =, from =, and by = as shown in the example below. Set expand = c(0,0) to eliminate padding space around the axes (this can be used on any _x_ or _y_ scale.

For discrete scales, you can adjust the order of level appearance with breaks =, and how the values display with the labels = argument. Provide a character vector to each of those (see example below). You can also drop NA easily by setting na.translate = FALSE.

The nuances of date scales are covered more extensively in the Epidemic curves page.

Manual adjustments

One of the most useful tricks is using “manual” scaling functions to explicitly assign colors as you desire. These are functions with the syntax scale_xxx_manual() (e.g. scale_colour_manual() or scale_fill_manual()). Each of the below arguments are demonstrated in the code example below.

  • Assign colors to data values with the values = argument
  • Specify a color for NA with na.value =
  • Change how the values are written in the legend with the labels = argument
  • Change the legend title with name =

Below, we create a bar plot and show how it appears by default, and then with three scales adjusted - the continuous y-axis scale, the discrete x-axis scale, and manual adjustment of the fill (interior bar color).

# BASELINE - no scale adjustment
ggplot(data = linelist)+
  geom_bar(mapping = aes(x = outcome, fill = gender))+
  labs(title = "Baseline - no scale adjustments")

# SCALES ADJUSTED
ggplot(data = linelist)+
  
  geom_bar(mapping = aes(x = outcome, fill = gender), color = "black")+
  
  theme_minimal()+                   # simplify background
  
  scale_y_continuous(                # continuous scale for y-axis (counts)
    expand = c(0,0),                 # no padding
    breaks = seq(from = 0,
                 to = 3000,
                 by = 500))+
  
  scale_x_discrete(                   # discrete scale for x-axis (gender)
    expand = c(0,0),                  # no padding
    drop = FALSE,                     # show all factor levels (even if not in data)
    na.translate = FALSE,             # remove NA outcomes from plot
    labels = c("Died", "Recovered"))+ # Change display of values
    
  
  scale_fill_manual(                  # Manually specify fill (bar interior color)
    values = c("m" = "violetred",     # reference values in data to assign colors
               "f" = "aquamarine"),
    labels = c("m" = "Male",          # re-label the legend (use "=" assignment to avoid mistakes)
              "f" = "Female",
              "Missing"),
    name = "Gender",                  # title of legend
    na.value = "grey"                 # assign a color for missing values
  )+
  labs(title = "Adjustment of scales") # Adjust the title of the fill legend

Continuous axes scales

When data are mapping to the plot axes, these too can be adjusted with scales commands. A common example is adjusting the display of an axis (e.g. y-axis) that is mapped to a column with continuous data.

We may want to adjust the breaks or display of the values in the ggplot using scale_y_continuous(). As noted above, use the argument breaks = to provide a sequence of values that will serve as “breaks” along the scale. These are the values at which numbers will display. To this argument, you can provide a c() vector containing the desired break values, or you can provide a regular sequence of numbers using the base R function seq(). This seq() function accepts to =, from =, and by =.

# BASELINE - no scale adjustment
ggplot(data = linelist)+
  geom_bar(mapping = aes(x = outcome, fill = gender))+
  labs(title = "Baseline - no scale adjustments")

# 
ggplot(data = linelist)+
  geom_bar(mapping = aes(x = outcome, fill = gender))+
  scale_y_continuous(
    breaks = seq(
      from = 0,
      to = 3000,
      by = 100)
  )+
  labs(title = "Adjusted y-axis breaks")

Display percents

If your original data values are proportions, you can easily display them as percents with “%” by providing labels = scales::percent in your scales command, as shown below.

While an alternative would be to convert the values to character and add a “%” character to the end, this approach will cause complications because your data will no longer be continuous numeric values.

# Original y-axis proportions
#############################
linelist %>%                                   # start with linelist
  group_by(hospital) %>%                       # group data by hospital
  summarise(                                   # create summary columns
    n = n(),                                     # total number of rows in group
    deaths = sum(outcome == "Death", na.rm=T),   # number of deaths in group
    prop_death = deaths/n) %>%                   # proportion deaths in group
  ggplot(                                      # begin plotting
    mapping = aes(
      x = hospital,
      y = prop_death))+ 
  geom_col()+
  theme_minimal()+
  labs(title = "Display y-axis original proportions")



# Display y-axis proportions as percents
########################################
linelist %>%         
  group_by(hospital) %>% 
  summarise(
    n = n(),
    deaths = sum(outcome == "Death", na.rm=T),
    prop_death = deaths/n) %>% 
  ggplot(
    mapping = aes(
      x = hospital,
      y = prop_death))+
  geom_col()+
  theme_minimal()+
  labs(title = "Display y-axis as percents (%)")+
  scale_y_continuous(
    labels = scales::percent                    # display proportions as percents
  )

Log scale

To transform a continuous axis to log scale, add trans = "log2" to the scale command. For purposes of example, we create a data frame of regions with their respective preparedness_index and cumulative cases values.

plot_data <- data.frame(
  region = c("A", "B", "C", "D", "E", "F", "G", "H", "I"),
  preparedness_index = c(8.8, 7.5, 3.4, 3.6, 2.1, 7.9, 7.0, 5.6, 1.0),
  cases_cumulative = c(15, 45, 80, 20, 21, 7, 51, 30, 1442)
)

plot_data
##   region preparedness_index cases_cumulative
## 1      A                8.8               15
## 2      B                7.5               45
## 3      C                3.4               80
## 4      D                3.6               20
## 5      E                2.1               21
## 6      F                7.9                7
## 7      G                7.0               51
## 8      H                5.6               30
## 9      I                1.0             1442

The cumulative cases for region “I” are dramatically greater than all the other regions. In circumstances like this, you may elect to display the y-axis using a log scale so the reader can see differences between the regions with fewer cumulative cases.

# Original y-axis
preparedness_plot <- ggplot(data = plot_data,  
       mapping = aes(
         x = preparedness_index,
         y = cases_cumulative))+
  geom_point(size = 2)+            # points for each region 
  geom_text(
    mapping = aes(label = region),
    vjust = 1.5)+                  # add text labels
  theme_minimal()

preparedness_plot                  # print original plot


# print with y-axis transformed
preparedness_plot+                   # begin with plot saved above
  scale_y_continuous(trans = "log2") # add transformation for y-axis

Gradient scales

Fill gradient scales can involve additional nuance. The defaults are usually quite pleasing, but you may want to adjust the values, cutoffs, etc.

To demonstrate how to adjust a continuous color scale, we’ll use a data set from the Contact tracing page that contains the ages of cases and of their source cases.

case_source_relationships <- rio::import(here::here("data", "godata", "relationships_clean.rds")) %>% 
  select(source_age, target_age) 

Below, we produce a “raster” heat tile density plot. We won’t elaborate how (see the link in paragraph above) but we will focus on how we can adjust the color scale. Read more about the stat_density2d() ggplot2 function here. Note how the fill scale is continuous.

trans_matrix <- ggplot(
    data = case_source_relationships,
    mapping = aes(x = source_age, y = target_age))+
  stat_density2d(
    geom = "raster",
    mapping = aes(fill = after_stat(density)),
    contour = FALSE)+
  theme_minimal()

Now we show some variations on the fill scale:

trans_matrix
trans_matrix + scale_fill_viridis_c(option = "plasma")

Now we show some examples of actually adjusting the break points of the scale:

  • scale_fill_gradient() accepts two colors (high/low)
  • scale_fill_gradientn() accepts a vector of any length of colors to values = (intermediate values will be interpolated)
  • Use scales::rescale() to adjust how colors are positioned along the gradient; it rescales your vector of positions to be between 0 and 1.
trans_matrix + 
  scale_fill_gradient(     # 2-sided gradient scale
    low = "aquamarine",    # low value
    high = "purple",       # high value
    na.value = "grey",     # value for NA
    name = "Density")+     # Legend title
  labs(title = "Manually specify high/low colors")

# 3+ colors to scale
trans_matrix + 
  scale_fill_gradientn(    # 3-color scale (low/mid/high)
    colors = c("blue", "yellow","red") # provide colors in vector
  )+
  labs(title = "3-color scale")

# Use of rescale() to adjust placement of colors along scale
trans_matrix + 
  scale_fill_gradientn(    # provide any number of colors
    colors = c("blue", "yellow","red", "black"),
    values = scales::rescale(c(0, 0.05, 0.07, 0.10, 0.15, 0.20, 0.3, 0.5)) # positions for colors are rescaled between 0 and 1
    )+
  labs(title = "Colors not evenly positioned")

# use of limits to cut-off values that get fill color
trans_matrix + 
  scale_fill_gradientn(    
    colors = c("blue", "yellow","red"),
    limits = c(0, 0.0002))+
  labs(title = "Restrict value limits, resulting in grey space")

Palettes

Colorbrewer and Viridis

More generally, if you want predefined palettes, you can use the scale_xxx_brewer or scale_xxx_viridis_y functions.

The ‘brewer’ functions can draw from colorbrewer.org palettes.

The ‘viridis’ functions draw from viridis (colourblind friendly!) palettes, which “provide colour maps that are perceptually uniform in both colour and black-and-white. They are also designed to be perceived by viewers with common forms of colour blindness.” (read more here and here). Define if the palette is discrete, continuous, or binned by specifying this at the end of the function (e.g. discrete is scale_xxx_viridis_d).

It is advised that you test your plot in this color blindness simulator. If you have a red/green color scheme, try a “hot-cold” (red-blue) scheme instead as described here

Here is an example from the ggplot basics page, using various color schemes.

symp_plot <- linelist %>%                                         # begin with linelist
  select(c(case_id, fever, chills, cough, aches, vomit)) %>%     # select columns
  pivot_longer(                                                  # pivot longer
    cols = -case_id,                                  
    names_to = "symptom_name",
    values_to = "symptom_is_present") %>%
  mutate(                                                        # replace missing values
    symptom_is_present = replace_na(symptom_is_present, "unknown")) %>% 
  ggplot(                                                        # begin ggplot!
    mapping = aes(x = symptom_name, fill = symptom_is_present))+
  geom_bar(position = "fill", col = "black") +                    
  theme_classic() +
  theme(legend.position = "bottom")+
  labs(
    x = "Symptom",
    y = "Symptom status (proportion)"
  )

symp_plot  # print with default colors

#################################
# print with manually-specified colors
symp_plot +
  scale_fill_manual(
    values = c("yes" = "black",         # explicitly define colours
               "no" = "white",
               "unknown" = "grey"),
    breaks = c("yes", "no", "unknown"), # order the factors correctly
    name = ""                           # set legend to no title

  ) 

#################################
# print with viridis discrete colors
symp_plot +
  scale_fill_viridis_d(
    breaks = c("yes", "no", "unknown"),
    name = ""
  )

31.3 Change order of discrete variables

Changing the order that discrete variables appear in is often difficult to understand for people who are new to ggplot2 graphs. It’s easy to understand how to do this however once you understand how ggplot2 handles discrete variables under the hood. Generally speaking, if a discrete varaible is used, it is automatically converted to a factor type - which orders factors by alphabetical order by default. To handle this, you simply have to reorder the factor levels to reflect the order you would like them to appear in the chart. For more detailed information on how to reorder factor objects, see the factor section of the guide.

We can look at a common example using age groups - by default the 5-9 age group will be placed in the middle of the age groups (given alphanumeric order), but we can move it behind the 0-4 age group of the chart by releveling the factors.

ggplot(
  data = linelist %>% drop_na(age_cat5),                         # remove rows where age_cat5 is missing
  mapping = aes(x = fct_relevel(age_cat5, "5-9", after = 1))) +  # relevel factor

  geom_bar() +
  
  labs(x = "Age group", y = "Number of hospitalisations",
       title = "Total hospitalisations by age group") +
  
  theme_minimal()

31.3.0.1 ggthemr

Also consider using the ggthemr package. You can download this package from Github using the instructions here. It offers palettes that are very aesthetically pleasing, but be aware that these typically have a maximum number of values that can be limiting if you want more than 7 or 8 colors.

31.4 Contour lines

Contour plots are helpful when you have many points that might cover each other (“overplotting”). The case-source data used above are again plotted, but more simply using stat_density2d() and stat_density2d_filled() to produce discrete contour levels - like a topographical map. Read more about the statistics here.

case_source_relationships %>% 
  ggplot(aes(x = source_age, y = target_age))+
  stat_density2d()+
  geom_point()+
  theme_minimal()+
  labs(title = "stat_density2d() + geom_point()")


case_source_relationships %>% 
  ggplot(aes(x = source_age, y = target_age))+
  stat_density2d_filled()+
  theme_minimal()+
  labs(title = "stat_density2d_filled()")

31.5 Marginal distributions

To show the distributions on the edges of a geom_point() scatterplot, you can use the ggExtra package and its function ggMarginal(). Save your original ggplot as an object, then pass it to ggMarginal() as shown below. Here are the key arguments:

  • You must specify the type = as either “histogram”, “density” “boxplot”, “violin”, or “densigram”.
  • By default, marginal plots will appear for both axes. You can set margins = to “x” or “y” if you only want one.
  • Other optional arguments include fill = (bar color), color = (line color), size = (plot size relative to margin size, so larger number makes the marginal plot smaller).
  • You can provide other axis-specific arguments to xparams = and yparams =. For example, to have different histogram bin sizes, as shown below.

You can have the marginal plots reflect groups (columns that have been assigned to color = in your ggplot() mapped aesthetics). If this is the case, set the ggMarginal() argument groupColour = or groupFill = to TRUE, as shown below.

Read more at this vignette, in the R Graph Gallery or the function R documentation ?ggMarginal.

# Install/load ggExtra
pacman::p_load(ggExtra)

# Basic scatter plot of weight and age
scatter_plot <- ggplot(data = linelist)+
  geom_point(mapping = aes(y = wt_kg, x = age)) +
  labs(title = "Scatter plot of weight and age")

To add marginal histograms use type = "histogram". You can optionally set groupFill = TRUE to get stacked histograms.

# with histograms
ggMarginal(
  scatter_plot,                     # add marginal histograms
  type = "histogram",               # specify histograms
  fill = "lightblue",               # bar fill
  xparams = list(binwidth = 10),    # other parameters for x-axis marginal
  yparams = list(binwidth = 5))     # other parameters for y-axis marginal

Marginal density plot with grouped/colored values:

# Scatter plot, colored by outcome
# Outcome column is assigned as color in ggplot. groupFill in ggMarginal set to TRUE
scatter_plot_color <- ggplot(data = linelist %>% drop_na(gender))+
  geom_point(mapping = aes(y = wt_kg, x = age, color = gender)) +
  labs(title = "Scatter plot of weight and age")+
  theme(legend.position = "bottom")

ggMarginal(scatter_plot_color, type = "density", groupFill = TRUE)

Set the size = arguemnt to adjust the relative size of the marginal plot. Smaller number makes a larger marginal plot. You also set color =. Below are is a marginal boxplot, with demonstration of the margins = argument so it appears on only one axis:

# with boxplot 
ggMarginal(
  scatter_plot,
  margins = "x",      # only show x-axis marginal plot
  type = "boxplot")   

31.6 Smart Labeling

In ggplot2, it is also possible to add text to plots. However, this comes with the notable limitation where text labels often clash with data points in a plot, making them look messy or hard to read. There is no ideal way to deal with this in the base package, but there is a ggplot2 add-on, known as ggrepel that makes dealing with this very simple!

The ggrepel package provides two new functions, geom_label_repel() and geom_text_repel(), which replace geom_label() and geom_text(). Simply use these functions instead of the base functions to produce neat labels. Within the function, map the aesthetics aes() as always, but include the argument label = to which you provide a column name containing the values you want to display (e.g. patient id, or name, etc.). You can make more complex labels by combining columns and newlines (\n) within str_glue() as shown below.

A few tips:

  • Use min.segment.length = 0 to always draw line segments, or min.segment.length = Inf to never draw them
  • Use size = outside of aes() to set text size
  • Use force = to change the degree of repulsion between labels and their respective points (default is 1)
  • Include fill = within aes() to have label colored by value
    • A letter “a” may appear in the legend - add guides(fill = guide_legend(override.aes = aes(color = NA)))+ to remove it

See this is very in-depth tutorial for more.

pacman::p_load(ggrepel)

linelist %>%                                               # start with linelist
  group_by(hospital) %>%                                   # group by hospital
  summarise(                                               # create new dataset with summary values per hospital
    n_cases = n(),                                           # number of cases per hospital
    delay_mean = round(mean(days_onset_hosp, na.rm=T),1),    # mean delay per hospital
  ) %>% 
  ggplot(mapping = aes(x = n_cases, y = delay_mean))+      # send data frame to ggplot
  geom_point(size = 2)+                                    # add points
  geom_label_repel(                                        # add point labels
    mapping = aes(
      label = stringr::str_glue(
        "{hospital}\n{n_cases} cases, {delay_mean} days")  # how label displays
      ), 
    size = 3,                                              # text size in labels
    min.segment.length = 0)+                               # show all line segments                
  labs(                                                    # add axes labels
    title = "Mean delay to admission, by hospital",
    x = "Number of cases",
    y = "Mean delay (days)")

You can label only a subset of the data points - by using standard ggplot() syntax to provide different data = for each geom layer of the plot. Below, All cases are plotted, but only a few are labeled.

ggplot()+
  # All points in grey
  geom_point(
    data = linelist,                                   # all data provided to this layer
    mapping = aes(x = ht_cm, y = wt_kg),
    color = "grey",
    alpha = 0.5)+                                              # grey and semi-transparent
  
  # Few points in black
  geom_point(
    data = linelist %>% filter(days_onset_hosp > 15),  # filtered data provided to this layer
    mapping = aes(x = ht_cm, y = wt_kg),
    alpha = 1)+                                                # default black and not transparent
  
  # point labels for few points
  geom_label_repel(
    data = linelist %>% filter(days_onset_hosp > 15),  # filter the data for the labels
    mapping = aes(
      x = ht_cm,
      y = wt_kg,
      fill = outcome,                                          # label color by outcome
      label = stringr::str_glue("Delay: {days_onset_hosp}d")), # label created with str_glue()
    min.segment.length = 0) +                                  # show line segments for all
  
  # remove letter "a" from inside legend boxes
  guides(fill = guide_legend(override.aes = aes(color = NA)))+
  
  # axis labels
  labs(
    title = "Cases with long delay to admission",
    y = "weight (kg)",
    x = "height(cm)")

31.7 Time axes

Working with time axes in ggplot can seem daunting, but is made very easy with a few key functions. Remember that when working with time or date that you should ensure that the correct variables are formatted as date or datetime class - see the Working with dates page for more information on this, or Epidemic curves page (ggplot section) for examples.

The single most useful set of functions for working with dates in ggplot2 are the scale functions (scale_x_date(), scale_x_datetime(), and their cognate y-axis functions). These functions let you define how often you have axis labels, and how to format axis labels. To find out how to format dates, see the working with dates section again! You can use the date_breaks and date_labels arguments to specify how dates should look:

  1. date_breaks allows you to specify how often axis breaks occur - you can pass a string here (e.g. "3 months", or "2 days")

  2. date_labels allows you to define the format dates are shown in. You can pass a date format string to these arguments (e.g. "%b-%d-%Y"):

# make epi curve by date of onset when available
ggplot(linelist, aes(x = date_onset)) +
  geom_histogram(binwidth = 7) +
  scale_x_date(
    # 1 break every 1 month
    date_breaks = "1 months",
    # labels should show month then date
    date_labels = "%b %d"
  ) +
  theme_classic()

31.8 Highlighting

Highlighting specific elements in a chart is a useful way to draw attention to a specific instance of a variable while also providing information on the dispersion of the full dataset. While this is not easily done in base ggplot2, there is an external package that can help to do this known as gghighlight. This is easy to use within the ggplot syntax.

The gghighlight package uses the gghighlight() function to achieve this effect. To use this function, supply a logical statement to the function - this can have quite flexible outcomes, but here we’ll show an example of the age distribution of cases in our linelist, highlighting them by outcome.

# load gghighlight
library(gghighlight)

# replace NA values with unknown in the outcome variable
linelist <- linelist %>%
  mutate(outcome = replace_na(outcome, "Unknown"))

# produce a histogram of all cases by age
ggplot(
  data = linelist,
  mapping = aes(x = age_years, fill = outcome)) +
  geom_histogram() + 
  gghighlight::gghighlight(outcome == "Death")     # highlight instances where the patient has died.

This also works well with faceting functions - it allows the user to produce facet plots with the background data highlighted that doesn’t apply to the facet! Below we count cases by week and plot the epidemic curves by hospital (color = and facet_wrap() set to hospital column).

# produce a histogram of all cases by age
linelist %>% 
  count(week = lubridate::floor_date(date_hospitalisation, "week"),
        hospital) %>% 
  ggplot()+
  geom_line(aes(x = week, y = n, color = hospital))+
  theme_minimal()+
  gghighlight::gghighlight() +                      # highlight instances where the patient has died
  facet_wrap(~hospital)                              # make facets by outcome

31.9 Plotting multiple datasets

Note that properly aligning axes to plot from multiple datasets in the same plot can be difficult. Consider one of the following strategies:

  • Merge the data prior to plotting, and convert to “long” format with a column reflecting the dataset
  • Use cowplot or a similar package to combine two plots (see below)

31.10 Combine plots

Two packages that are very useful for combining plots are cowplot and patchwork. In this page we will mostly focus on cowplot, with occassional use of patchwork.

Here is the online introduction to cowplot. You can read the more extensive documentation for each function online here. We will cover a few of the most common use cases and functions below.

The cowplot package works in tandem with ggplot2 - essentially, you use it to arrange and combine ggplots and their legends into compound figures. It can also accept base R graphics.

pacman::p_load(
  tidyverse,      # data manipulation and visualisation
  cowplot,        # combine plots
  patchwork       # combine plots
)

While faceting (described in the ggplot basics page) is a convenient approach to plotting, sometimes its not possible to get the results you want from its relatively restrictive approach. Here, you may choose to combine plots by sticking them together into a larger plot. There are three well known packages that are great for this - cowplot, gridExtra, and patchwork. However, these packages largely do the same things, so we’ll focus on cowplot for this section.

plot_grid()

The cowplot package has a fairly wide range of functions, but the easiest use of it can be achieved through the use of plot_grid(). This is effectively a way to arrange predefined plots in a grid formation. We can work through another example with the malaria dataset - here we can plot the total cases by district, and also show the epidemic curve over time.

malaria_data <- rio::import(here::here("data", "malaria_facility_count_data.rds")) 

# bar chart of total cases by district
p1 <- ggplot(malaria_data, aes(x = District, y = malaria_tot)) +
  geom_bar(stat = "identity") +
  labs(
    x = "District",
    y = "Total number of cases",
    title = "Total malaria cases by district"
  ) +
  theme_minimal()

# epidemic curve over time
p2 <- ggplot(malaria_data, aes(x = data_date, y = malaria_tot)) +
  geom_col(width = 1) +
  labs(
    x = "Date of data submission",
    y =  "number of cases"
  ) +
  theme_minimal()

cowplot::plot_grid(p1, p2,
                  # 1 column and two rows - stacked on top of each other
                   ncol = 1,
                   nrow = 2,
                   # top plot is 2/3 as tall as second
                   rel_heights = c(2, 3))

Combine legends

If your plots have the same legend, combining them is relatively straight-forward. Simple use the cowplot approach above to combine the plots, but remove the legend from one of them (de-duplicate).

If your plots have different legends, you must use an alternative approach:

  1. Create and save your plots without legends using theme(legend.position = "none")
  2. Extract the legends from each plot using get_legend() as shown below - but extract legends from the plots modified to actually show the legend
  3. Combine the legends into a legends panel
  4. Combine the plots and legends panel

For demonstration we show the two plots separately, and then arranged in a grid with their own legends showing (ugly and inefficient use of space):

p1 <- linelist %>% 
  mutate(hospital = recode(hospital, "St. Mark's Maternity Hospital (SMMH)" = "St. Marks")) %>% 
  count(hospital, outcome) %>% 
  ggplot()+
  geom_col(mapping = aes(x = hospital, y = n, fill = outcome))+
  scale_fill_brewer(type = "qual", palette = 4, na.value = "grey")+
  coord_flip()+
  theme_minimal()+
  labs(title = "Cases by outcome")


p2 <- linelist %>% 
  mutate(hospital = recode(hospital, "St. Mark's Maternity Hospital (SMMH)" = "St. Marks")) %>% 
  count(hospital, age_cat) %>% 
  ggplot()+
  geom_col(mapping = aes(x = hospital, y = n, fill = age_cat))+
  scale_fill_brewer(type = "qual", palette = 1, na.value = "grey")+
  coord_flip()+
  theme_minimal()+
  theme(axis.text.y = element_blank())+
  labs(title = "Cases by age")

Here is how the two plots look when combined using plot_grid() without combining their legends:

cowplot::plot_grid(p1, p2, rel_widths = c(0.3))

And now we show how to combine the legends. Essentially what we do is to define each plot without its legend (theme(legend.position = "none"), and then we define each plot’s legend separately, using the get_legend() function from cowplot. When we extract the legend from the saved plot, we need to add + the legend back in, including specifying the placement (“right”) and smaller adjustments for alignment of the legends and their titles. Then, we combine the legends together vertically, and then combine the two plots with the newly-combined legends. Voila!

# Define plot 1 without legend
p1 <- linelist %>% 
  mutate(hospital = recode(hospital, "St. Mark's Maternity Hospital (SMMH)" = "St. Marks")) %>% 
  count(hospital, outcome) %>% 
  ggplot()+
  geom_col(mapping = aes(x = hospital, y = n, fill = outcome))+
  scale_fill_brewer(type = "qual", palette = 4, na.value = "grey")+
  coord_flip()+
  theme_minimal()+
  theme(legend.position = "none")+
  labs(title = "Cases by outcome")


# Define plot 2 without legend
p2 <- linelist %>% 
  mutate(hospital = recode(hospital, "St. Mark's Maternity Hospital (SMMH)" = "St. Marks")) %>% 
  count(hospital, age_cat) %>% 
  ggplot()+
  geom_col(mapping = aes(x = hospital, y = n, fill = age_cat))+
  scale_fill_brewer(type = "qual", palette = 1, na.value = "grey")+
  coord_flip()+
  theme_minimal()+
  theme(
    legend.position = "none",
    axis.text.y = element_blank(),
    axis.title.y = element_blank()
  )+
  labs(title = "Cases by age")


# extract legend from p1 (from p1 + legend)
leg_p1 <- cowplot::get_legend(p1 +
                                theme(legend.position = "right",        # extract vertical legend
                                      legend.justification = c(0,0.5))+ # so legends  align
                                labs(fill = "Outcome"))                 # title of legend
# extract legend from p2 (from p2 + legend)
leg_p2 <- cowplot::get_legend(p2 + 
                                theme(legend.position = "right",         # extract vertical legend   
                                      legend.justification = c(0,0.5))+  # so legends align
                                labs(fill = "Age Category"))             # title of legend

# create a blank plot for legend alignment
#blank_p <- patchwork::plot_spacer() + theme_void()

# create legends panel, can be one on top of the other (or use spacer commented above)
legends <- cowplot::plot_grid(leg_p1, leg_p2, nrow = 2, rel_heights = c(.3, .7))

# combine two plots and the combined legends panel
combined <- cowplot::plot_grid(p1, p2, legends, ncol = 3, rel_widths = c(.4, .4, .2))

combined  # print

This solution was learned from this post with a minor fix to align legends from this post.

TIP: Fun note - the “cow” in cowplot comes from the creator’s name - Claus O. Wilke.

Inset plots

You can inset one plot in another using cowplot. Here are things to be aware of:

  • Define the main plot with theme_half_open() from cowplot; it may be best to have the legend either on top or bottom
  • Define the inset plot. Best is to have a plot where you do not need a legend. You can remove plot theme elements with element_blank() as shown below.
  • Combine them by applying ggdraw() to the main plot, then adding draw_plot() on the inset plot and specifying the coordinates (x and y of lower left corner), height and width as proportion of the whole main plot.
# Define main plot
main_plot <- ggplot(data = linelist)+
  geom_histogram(aes(x = date_onset, fill = hospital))+
  scale_fill_brewer(type = "qual", palette = 1, na.value = "grey")+ 
  theme_half_open()+
  theme(legend.position = "bottom")+
  labs(title = "Epidemic curve and outcomes by hospital")


# Define inset plot
inset_plot <- linelist %>% 
  mutate(hospital = recode(hospital, "St. Mark's Maternity Hospital (SMMH)" = "St. Marks")) %>% 
  count(hospital, outcome) %>% 
  ggplot()+
    geom_col(mapping = aes(x = hospital, y = n, fill = outcome))+
    scale_fill_brewer(type = "qual", palette = 4, na.value = "grey")+
    coord_flip()+
    theme_minimal()+
    theme(legend.position = "none",
          axis.title.y = element_blank())+
    labs(title = "Cases by outcome") 


# Combine main with inset
cowplot::ggdraw(main_plot)+
     draw_plot(inset_plot,
               x = .6, y = .55,    #x = .07, y = .65,
               width = .4, height = .4)

This technique is explained more in these two vignettes:

Wilke lab
draw_plot() documentation

31.11 Dual axes

A secondary y-axis is often a requested addition to a ggplot2 graph. While there is a robust debate about the validity of such graphs in the data visualization community, and they are often not recommended, your manager may still want them. Below, we present one method to achieve them: using the cowplot package to combine two separate plots.

This approach involves creating two separate plots - one with a y-axis on the left, and the other with y-axis on the right. Both will use a specific theme_cowplot() and must have the same x-axis. Then in a third command the two plots are aligned and overlaid on top of each other. The functionalities of cowplot, of which this is only one, are described in depth at this site.

To demonstrate this technique we will overlay the epidemic curve with a line of the weekly percent of patients who died. We use this example because the alignment of dates on the x-axis is more complex than say, aligning a bar chart with another plot. Some things to note:

  • The epicurve and the line are aggregated into weeks prior to plotting and the date_breaks and date_labels are identical - we do this so that the x-axes of the two plots are the same when they are overlaid.
  • The y-axis is moved to the right-side for plot 2 with the position = argument of scale_y_continuous().
  • Both plots make use of theme_cowplot()

Note there is another example of this technique in the Epidemic curves page - overlaying cumulative incidence on top of the epicurve.

Make plot 1
This is essentially the epicurve. We use geom_area() just to demonstrate its use (area under a line, by default)

pacman::p_load(cowplot)            # load/install cowplot

p1 <- linelist %>%                 # save plot as object
     count(
       epiweek = lubridate::floor_date(date_onset, "week")) %>% 
     ggplot()+
          geom_area(aes(x = epiweek, y = n), fill = "grey")+
          scale_x_date(
               date_breaks = "month",
               date_labels = "%b")+
     theme_cowplot()+
     labs(
       y = "Weekly cases"
     )

p1                                      # view plot 

Make plot 2
Create the second plot showing a line of the weekly percent of cases who died.

p2 <- linelist %>%         # save plot as object
     group_by(
       epiweek = lubridate::floor_date(date_onset, "week")) %>% 
     summarise(
       n = n(),
       pct_death = 100*sum(outcome == "Death", na.rm=T) / n) %>% 
     ggplot(aes(x = epiweek, y = pct_death))+
          geom_line()+
          scale_x_date(
               date_breaks = "month",
               date_labels = "%b")+
          scale_y_continuous(
               position = "right")+
          theme_cowplot()+
          labs(
            x = "Epiweek of symptom onset",
            y = "Weekly percent of deaths",
            title = "Weekly case incidence and percent deaths"
          )

p2     # view plot

Now we align the plot using the function align_plots(), specifying horizontal and vertical alignment (“hv”, could also be “h”, “v”, “none”). We specify alignment of all axes as well (top, bottom, left, and right) with “tblr”. The output is of class list (2 elements).

Then we draw the two plots together using ggdraw() (from cowplot) and referencing the two parts of the aligned_plots object.

aligned_plots <- cowplot::align_plots(p1, p2, align="hv", axis="tblr")         # align the two plots and save them as list
aligned_plotted <- ggdraw(aligned_plots[[1]]) + draw_plot(aligned_plots[[2]])  # overlay them and save the visual plot
aligned_plotted                                                                # print the overlayed plots

31.12 Packages to help you

There are some really neat R packages specifically designed to help you navigate ggplot2:

Point-and-click ggplot2 with equisse

“This addin allows you to interactively explore your data by visualizing it with the ggplot2 package. It allows you to draw bar plots, curves, scatter plots, histograms, boxplot and sf objects, then export the graph or retrieve the code to reproduce the graph.”

Install and then launch the addin via the RStudio menu or with esquisse::esquisser().

See the Github page

Documentation

31.13 Miscellaneous

Numeric display

You can disable scientific notation by running this command prior to plotting.

options(scipen=999)

Or apply number_format() from the scales package to a specific value or column, as shown below.

Use functions from the package scales to easily adjust how numbers are displayed. These can be applied to columns in your data frame, but are shown on individual numbers for purpose of example.

scales::number(6.2e5)
## [1] "620 000"
scales::number(1506800.62,  accuracy = 0.1,)
## [1] "1 506 800.6"
scales::comma(1506800.62, accuracy = 0.01)
## [1] "1,506,800.62"
scales::comma(1506800.62, accuracy = 0.01,  big.mark = "." , decimal.mark = ",")
## [1] "1.506.800,62"
scales::percent(0.1)
## [1] "10%"
scales::dollar(56)
## [1] "$56"
scales::scientific(100000)
## [1] "1e+05"

31.14 Resources

Inspiration ggplot graph gallery

Presentation of data European Centre for Disease Prevention and Control Guidelines of presentation of surveillance data

Facets and labellers Using labellers for facet strips Labellers

Adjusting order with factors fct_reorder
fct_inorder
How to reorder a boxplot
Reorder a variable in ggplot2
R for Data Science - Factors

Legends
Adjust legend order

Captions Caption alignment

Labels
ggrepel

Cheatsheets
Beautiful plotting with ggplot2

32 Epidemic curves

An epidemic curve (also known as an “epi curve”) is a core epidemiological chart typically used to visualize the temporal pattern of illness onset among a cluster or epidemic of cases.

Analysis of the epicurve can reveal temporal trends, outliers, the magnitude of the outbreak, the most likely time period of exposure, time intervals between case generations, and can even help identify the mode of transmission of an unidentified disease (e.g. point source, continuous common source, person-to-person propagation). One online lesson on interpretation of epi curves can be found at the website of the US CDC.

In this page we demonstrate two approaches to producing epicurves in R:

  • The incidence2 package, which can produce an epi curve with simple commands
  • The ggplot2 package, which allows for advanced customizability via more complex commands

Also addressed are specific use-cases such as:

  • Plotting aggregated count data
  • Faceting or producing small-multiples
  • Applying moving averages
  • Showing which data are “tentative” or subject to reporting delays
  • Overlaying cumulative case incidence using a second axis

32.1 Preparation

Packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
  rio,          # file import/export
  here,         # relative filepaths 
  lubridate,    # working with dates/epiweeks
  aweek,        # alternative package for working with dates/epiweeks
  incidence2,   # epicurves of linelist data
  i2extras,     # supplement to incidence2
  stringr,      # search and manipulate character strings
  forcats,      # working with factors
  RColorBrewer, # Color palettes from colorbrewer2.org
  tidyverse     # data management + ggplot2 graphics
) 

Import data

Two example datasets are used in this section:

  • Linelist of individual cases from a simulated epidemic
  • Aggregated counts by hospital from the same simulated epidemic

The datasets are imported using the import() function from the rio package. See the page on Import and export for various ways to import data.

Case linelist

We import the dataset of cases from a simulated Ebola epidemic. If you want to download the data to follow step-by-step, see instruction in the Download handbook and data page. We assume the file is in the working directory so no sub-folders are specified in this file path.

linelist <- import("linelist_cleaned.xlsx")

The first 50 rows are displayed below.

Case counts aggregated by hospital

For the purposes of the handbook, the dataset of weekly aggregated counts by hospital is created from the linelist with the following code.

# import the counts data into R
count_data <- linelist %>% 
  group_by(hospital, date_hospitalisation) %>% 
  summarize(n_cases = dplyr::n()) %>% 
  filter(date_hospitalisation > as.Date("2013-06-01")) %>% 
  ungroup()

The first 50 rows are displayed below:

Set parameters

For production of a report, you may want to set editable parameters such as the date for which the data is current (the “data date”). You can then reference the object data_date in your code when applying filters or in dynamic captions.

## set the report date for the report
## note: can be set to Sys.Date() for the current date
data_date <- as.Date("2015-05-15")

Verify dates

Verify that each relevant date column is class Date and has an appropriate range of values. You can do this simply using hist() for histograms, or range() with na.rm=TRUE, or with ggplot() as below.

# check range of onset dates
ggplot(data = linelist)+
  geom_histogram(aes(x = date_onset))

32.2 Epicurves with incidence2 package

Below we demonstrate how to make epicurves using the incidence2 package. The authors of this package have tried to allow the user to create and modify epicurves without needing to know ggplot2 syntax. Much of this page is adapted from the package vignettes, which can be found at the incidence2 github page.

Simple example

2 steps are required to plot an epidemic curve with the incidence2 package:

  1. Create an incidence object (using the function incidence())
    • Provide the data
    • Specify the date column to date_index =
    • Specify the interval = into which the cases should be aggregated (daily, weekly, monthly..)
    • Specify any grouping columns (e.g. gender, hospital, outcome)
  2. Plot the incidence object
    • Specify labels, colors, titles, etc.

Below, we load the incidence2 package, create the incidence object from the linelist on column date_onset and aggregated cases by day. We then print a summary of the incidence object.

# load incidence2 package
pacman::p_load(incidence2)

# create the incidence object, aggregating cases by day
epi_day <- incidence(       # create incidence object
  x = linelist,             # dataset
  date_index = date_onset,  # date column
  interval = "day"          # date grouping interval
  )

The incidence2 object itself looks like a tibble (like a data frame) and can be printed or further manipulated like a data frame.

class(epi_day)
## [1] "incidence2" "tbl_df"     "tbl"        "data.frame"

Here is what it looks like when printed. It has a date_index column and a count column.

epi_day
## An incidence2 object: 367 x 2
## 5632 cases from 2014-04-07 to 2015-04-30
## interval: 1 day
## cumulative: FALSE
## 
##    date_index count
##    <date>     <int>
##  1 2014-04-07     1
##  2 2014-04-15     1
##  3 2014-04-21     2
##  4 2014-04-25     1
##  5 2014-04-26     1
##  6 2014-04-27     1
##  7 2014-05-01     2
##  8 2014-05-03     1
##  9 2014-05-04     1
## 10 2014-05-05     1
## # ... with 357 more rows

You can also print a summary of the object:

# print summary of the incidence object
summary(epi_day)
## An incidence2 object: 367 x 2
## 5632 cases from 2014-04-07 to 2015-04-30
## interval: 1 day
## cumulative: FALSE
## timespan: 389 days

To plot the incidence object, use plot() on the name of the incidence object. In the background, the function plot.incidence2() is called, so to read the incidence2-specific documentation you would run ?plot.incidence2.

# plot the incidence object
plot(epi_day)

If you notice lots of tiny white vertical lines, try to adjust the size of your image. For example, if you export your plot with ggsave(), you can provide numbers to width = and height =. If you widen the plot those lines may disappear.

Change time interval of case aggregation

The interval = argument of incidence() defines how the observations are grouped into vertical bars.

Specify interval

incidence2 provides flexibility and understandable syntax for specifying how you want to aggregate your cases into epicurve bars. Provide a value like the ones below to the interval = argument. You can write any of the below as plural (e.g. “weeks”), and you can add numbers before (e.g. “3 months”).

Argument option Further explanation
Number (1, 7, 13, 14, etc.) Number of days per interval
“week” note: Monday start day is default
“2 weeks” or 3, 4, 5…
“Sunday week” weeks beginning on Sundays (could also use Thursday, etc.)
“2 Sunday weeks” or 3, 4, 5…
“MMWRweek” week starts on Sundays - see US CDC
“month” 1st of month
“quarter” 1st of month of quarter
“2 months” or 3, 4, 5…
“year” 1st day of calendar year

Below are examples of how different intervals look when applied to the linelist. Note how the default format and frequency of the date labels on the x-axis change as the date interval changes.

# Create the incidence objects (with different intervals)
##############################
# Weekly (Monday week by default)
epi_wk      <- incidence(linelist, date_onset, interval = "Monday week")

# Sunday week
epi_Sun_wk  <- incidence(linelist, date_onset, interval = "Sunday week")

# Three weeks (Monday weeks by default)
epi_2wk     <- incidence(linelist, date_onset, interval = "2 weeks")

# Monthly
epi_month   <- incidence(linelist, date_onset, interval = "month")

# Quarterly
epi_quarter   <- incidence(linelist, date_onset, interval = "quarter")

# Years
epi_year   <- incidence(linelist, date_onset, interval = "year")


# Plot the incidence objects (+ titles for clarity)
############################
plot(epi_wk)+      labs(title = "Monday weeks")
plot(epi_Sun_wk)+  labs(title = "Sunday weeks")
plot(epi_2wk)+     labs(title = "2 (Monday) weeks")
plot(epi_month)+   labs(title = "Months")
plot(epi_quarter)+ labs(title = "Quarters")
plot(epi_year)+    labs(title = "Years")

First date

You can optionally specify a value of class Date (e.g. as.Date("2016-05-01")) to firstdate = in the incidence() command. If given, the data will be trimmed to this range and the intervals will begin on this date.

Groups

Groups are specified in the incidence() command, and can be used to color the bars or to facet the data. To specify groups in your data provide the column name(s) to the groups = argument in the incidence() command (no quotes around the column name). If specifying multiple columns, put their names within c().

You can specify that cases with missing values in the grouping columns be listed as a distinct NA group by setting na_as_group = TRUE. Otherwise, they will be excluded from the plot.

  • To color the bars by a grouping column, you must again provide the column name to fill = in the plot() command.

  • To facet based on a grouping column, see the section below on facets with incidence2.

In the example below, the cases in the whole outbreak are grouped by their age category. Missing values are included as a group. The epicurve interval is weeks.

# Create incidence object, with data grouped by age category
age_outbreak <- incidence(
  linelist,                # dataset
  date_index = date_onset, # date column
  interval = "week",       # Monday weekly aggregation of cases
  groups = age_cat,        # age_cat is set as a group
  na_as_group = TRUE)      # missing values assigned their own group

# plot the grouped incidence object
plot(
  age_outbreak,             # incidence object with age_cat as group
  fill = age_cat)+          # age_cat is used for bar fill color (must have been set as a groups column above)
labs(fill = "Age Category") # change legend title from default "age_cat" (this is a ggplot2 modification)

TIP: Change the title of the legend by adding + the ggplot2 command labs(fill = "your title") to your incidence2 plot.

You can also have the grouped bars display side-by-side by setting stack = FALSE in plot(), as shown below:

# Make incidence object of monthly counts. 
monthly_gender <- incidence(
 linelist,
 date_index = date_onset,
 interval = "month",
 groups = gender            # set gender as grouping column
)

plot(
  monthly_gender,   # incidence object
  fill = gender,    # display bars colored by gender
  stack = FALSE)    # side-by-side (not stacked)

You can set the na_as_group = argument to FALSE in the incidence() command to remove rows with missing values from the plot.

Filtered data

To plot the epicurve of a subset of data:

  1. Filter the linelist data
  2. Provide the filtered data to the incidence() command
  3. Plot the incidence object

The example below uses data filtered to show only cases at Central Hospital.

# filter the linelist
central_data <- linelist %>% 
  filter(hospital == "Central Hospital")

# create incidence object using filtered data
central_outbreak <- incidence(central_data, date_index = date_onset, interval = "week")

# plot the incidence object
plot(central_outbreak, title = "Weekly case incidence at Central Hospital")

Aggregated counts

If your original data are aggregated (counts), provide the name of the column that contains the case counts to the count = argument when creating the incidence object with incidence().

For example, this data frame count_data is the linelist aggregated into daily counts by hospital. The first 50 rows look like this:

If you are beginning your analysis with daily count data like the dataset above, your incidence() command to convert this to a weekly epicurve by hospital would look like this:

epi_counts <- incidence(              # create weekly incidence object
  count_data,                         # dataset with counts aggregated by day
  date_index = date_hospitalisation,  # column with dates
  count = n_cases,                    # column with counts
  interval = "week",                  # aggregate daily counts up to weeks
  groups = hospital                   # group by hospital
  )

# plot the weekly incidence epi curve, with stacked bars by hospital
plot(epi_counts,                      # incidence object
     fill = hospital)                 # color the bars by hospital

Facets/small multiples

To facet the data by group (i.e. produce “small multiples”):

  1. Specify the faceting column to groups = when you create the incidence object
  2. Use the facet_plot() command instead of plot()
  3. Specify which grouping columns to use as fill = and which to use as facets =

Below, we set both columns hospital and outcome as grouping columns in the incidence() command. Then, in facet_plot() we plot the epicurve, specifying that we want a different epicurve for each hospital and that within each epicurve the bars should be stacked and colored by outcome.

epi_wks_hosp_out <- incidence(
  linelist,                      # dataset
  date_index = date_onset,       # date column
  interval = "month",            # monthly bars  
  groups = c(outcome, hospital)  # both outcome and hospital are given as grouping columns
  )

# plot
incidence2::facet_plot(
  epi_wks_hosp_out,      # incidence object
  facets = hospital,     # facet column
  fill = outcome)        # fill column

Note that the package ggtree (used for displaying phylogenetic trees) also has a function facet_plot() - this is why we specified incidence2::facet_plot() above.

Modifications with plot()

An epicurve produced by incidence2 can be modified via these arguments within the plot() function.

Here are plot() arguments that modify the appearance of the bars:

Argument Description Examples
fill = Bar color. Either a color name or a column name previously specified to groups = in the incidence() command fill = "red", or fill = gender
color = Color around each bar, or around each grouping within a bar border = "white"
legend = Location of legend One of “bottom”, “top”, “left”, “right”, or “none”
alpha = Transparency of bars/boxes 1 is fully opaque, 0 is fully transparent
width = Value between 0 and 1 indicating the relative size of the bars to their time interval width = .7
show_cases = Logical; if TRUE, each case shows as a box. Displays best on smaller outbreaks. show_cases = TRUE

Here are plot() arguments that modify the date axis:

Argument(s) Description
centre_dates = TRUE/FALSE as to whether date displays appear under center of bars, or at beginning of bars
date_format = Adjust the date display format using strptime (“%”) syntax. Only works if centre_dates = FALSE (details below).
n.breaks = Approximate number of x-axis label breaks desired.
angle = Angle of x-axis date labels (number of degrees)
size = Size of text in points

Note that the date_breaks = argument only works if centre_dates = FALSE. Provide a character value in quotation marks using the strptime syntax below, as detailed in the Working with dates page. You can use \n for a “newline”.

%d = Day number of month (5, 17, 28, etc.)
%j = Day number of the year (Julian day 001-366)
%a = Abbreviated weekday (Mon, Tue, Wed, etc.)
%A = Full weekday (Monday, Tuesday, etc.)
%w = Weekday number (0-6, Sunday is 0)
%u = Weekday number (1-7, Monday is 1)
%W = Week number (00-53, Monday is week start)
%U = Week number (01-53, Sunday is week start)
%m = Month number (e.g. 01, 02, 03, 04)
%b = Abbreviated month (Jan, Feb, etc.)
%B = Full month (January, February, etc.)
%y = 2-digit year (e.g. 89)
%Y = 4-digit year (e.g. 1989)
%h = hours (24-hr clock)
%m = minutes
%s = seconds
%z = offset from GMT
%Z = Time zone (character)

Here are plot() arguments that modify plot labels:

Argument(s) Description
title = Title of plot
xlab = Title of x-axis
ylab = Title of y-axis
size = Size of x-axis text in pts (use ggplot’s theme() to adjust other sizes)

An example using many of the above arguments:

# filter the linelist
central_data <- linelist %>% 
  filter(hospital == "Central Hospital")

# create incidence object using filtered data
central_outbreak <- incidence(
  central_data,
  date_index = date_onset,
  interval = "week",
  groups = outcome)

# plot incidence object
plot(
  central_outbreak,
  fill = outcome,                       # box/bar color
  legend = "top",                       # legend on top
  title = "Cases at Central Hospital",  # title
  xlab = "Week of onset",               # x-axis label
  ylab = "Week of onset",               # y-axis label
  show_cases = TRUE,                    # show each case as an individual box
  alpha = 0.7,                          # transparency 
  border = "grey",                      # box border
  angle = 30,                           # angle of date labels
  centre_dates = FALSE,                 # date labels at edge of bar
  date_format = "%a %d %b %Y\n(Week %W)" # adjust how dates are displayed
  )

To further adjust plot appearance, see the section below on modifications with ggplot().

Modifications with ggplot2

You can further modify an incidence2 plot by adding ggplot2 modifications with a + after the close of the incidence plot() function, as demonstrated below.

Below, the incidence2 plot finishes and then ggplot2 commands are used to modify the axes, add a caption, and adjust the bold font and text size.

Note that if you add scale_x_date(), most date formatting from plot() will be overwritten. See the ggplot() epicurves section and the Handbook page ggplot tips for more options.

# filter the linelist
central_data <- linelist %>% 
  filter(hospital == "Central Hospital")

# create incidence object using filtered data
central_outbreak <- incidence(
  central_data,
  date_index = date_onset,
  interval = "week",
  groups = c(outcome))

# plot incidence object
plot(
  central_outbreak,
  fill = outcome,                       # box/bar color
  legend = "top",                       # legend on top
  title = "Cases at Central Hospital",  # title
  xlab = "Week of onset",               # x-axis label
  ylab = "Week of onset",               # y-axis label
  show_cases = TRUE,                    # show each case as an individual box
  alpha = 0.7,                          # transparency 
  border = "grey",                      # box border
  centre_dates = FALSE,                   
  date_format = "%a %d %b\n%Y (Week %W)", 
  angle = 30                           # angle of date labels
  )+
  
  scale_y_continuous(
    breaks = seq(from = 0, to = 30, by = 5),  # specify y-axis increments by 5
    expand = c(0,0))+                         # remove excess space below 0 on y-axis
  
  # add dynamic caption
  labs(
    fill = "Patient outcome",                               # Legend title
    caption = stringr::str_glue(                            # dynamic caption - see page on characters and strings for details
      "n = {central_cases} from Central Hospital
      Case onsets range from {earliest_date} to {latest_date}. {missing_onset} cases are missing date of onset and not shown",
      central_cases = nrow(central_data),
      earliest_date = format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y'),
      latest_date = format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y'),      
      missing_onset = nrow(central_data %>% filter(is.na(date_onset)))))+
  
  # adjust bold face, and caption position
  theme(
    axis.title = element_text(size = 12, face = "bold"),    # axis titles larger and bold
    axis.text = element_text(size = 10, face = "bold"),     # axis text size and bold
    plot.caption = element_text(hjust = 0, face = "italic") # move caption to left
  )

Change colors

Specify a palette

Provide the name of a pre-defined palette to the col_pal = argument in plot(). The incidence2 package comes with 2 pre-defined paletted: “vibrant” and “muted”. In “vibrant” the first 6 colors and distinct and in “muted” the first 9 colors are distinct. After these numbers, the colors are interpolations/intermediaries of other colors. These pre-defined palettes can be found at this website. The palettes exclude grey, which is reserved for missing data (use na_color = to change this default).

# Create incidence object, with data grouped by age category  
age_outbreak <- incidence(
  linelist,
  date_index = date_onset,   # date of onset for x-axis
  interval = "week",         # weekly aggregation of cases
  groups = age_cat)

# plot the epicurve with default palette
plot(age_outbreak, fill = age_cat, title = "'vibrant' default incidence2 palette")

# plot with different color palette
#plot(age_outbreak, fill = age_cat, col_pal = muted, title = "'muted' incidence2 palette")

You can also use one of the base R palettes (put the name of the palette without quotes).

# plot with base R palette
plot(age_outbreak, fill = age_cat, col_pal = heat.colors, title = "base R heat.colors palette")

# plot with base R palette
plot(age_outbreak, fill = age_cat, col_pal = rainbow, title = "base R rainbow palette")

You can also add a color palette from the viridis package or RColorBrewer package. First those packages must be loaded, then add their respective scale_fill_*() functions with a +, as shown below.

pacman::p_load(RColorBrewer, viridis)

# plot with color palette
plot(age_outbreak, fill = age_cat, title = "Viridis palette")+
  scale_fill_viridis_d(
    option = "inferno",     # color scheme, try also "plasma" or the default
    name = "Age Category",  # legend name
    na.value = "grey")      # for missing values

# plot with color palette
plot(age_outbreak, fill = age_cat, title = "RColorBrewer palette")+
  scale_fill_brewer(
    palette = "Dark2",      # color palette, try also Accent, Dark2, Paired, Pastel1, Pastel2, Set1, Set2, Set3
    name = "Age Category",  # legend name
    na.value = "grey")      # for missing values

Specify manually

To specify colors manually, add the ggplot2 function scale_fill_manual() to the plot() with a + and provide the vector of colors names or HEX codes to the argument values =. The number of colors listed must equal the number of groups. Be aware of whether missing values are a group - they can be converted to a character value like “Missing” during your data preparation with the function fct_explicit_na() as explained in the page on Factors.

# manual colors
plot(age_outbreak, fill = age_cat, title = "Manually-specified colors")+
  scale_fill_manual(
    values = c("darkgreen", "darkblue", "purple", "grey", "yellow", "orange", "red", "lightblue"),  # colors
    name = "Age Category")      # Name for legend

As mentioned in the ggplot tips page, you can create your own palettes using colorRampPalette() on a vector of colors and specifying the number of colors you want in return. This is a good way to get many colors in a ramp by specifying a few.

my_cols <- c("darkgreen", "darkblue", "purple", "grey", "yellow", "orange")
my_palette <- colorRampPalette(my_cols)(12)  # expand the 6 colors above to 12 colors
my_palette
##  [1] "#006400" "#00363F" "#00097E" "#3A0BAF" "#821ADD" "#A84BE2" "#B592CB" "#C9C99B" "#E7E745" "#FFF600" "#FFCD00" "#FFA500"

Adjust level order

To adjust the order of group appearance (on plot and in legend), the grouping column must be class Factor. See the page on Factors for more information.

First, let’s see a weekly epicurve by hospital with the default ordering:

# ORIGINAL - hospital NOT as factor
###################################

# create weekly incidence object, rows grouped by hospital and week
hospital_outbreak <- incidence(
  linelist,
  date_index = date_onset, 
  interval = "week", 
  groups = hospital)

# plot incidence object
plot(hospital_outbreak, fill = hospital, title = "ORIGINAL - hospital not a factor")

Now, to adjust the order so that “Missing” and “Other” are at the top of the epicurve we can do the following:

  • Load the package forcats, to work with factors
  • Adjust the dataset - in this case we’ll define a new dataset (plot_data) in which:
    • the gender column is defined as a factor the order of levels are set with fct_relevel() so that “Other” and “Missing” are first, so they appear at the top of the bars
  • The incidence object is created and plotted as before
  • We add ggplot2 modifications
    • scale_fill_manual() to manually assign colors so that “Missing” is grey and “Other” is beige
# MODIFIED - hospital as factor
###############################

# load forcats package for working with factors
pacman::p_load(forcats)

# Convert hospital column to factor and adjust levels
plot_data <- linelist %>% 
  mutate(hospital = fct_relevel(hospital, c("Missing", "Other"))) # Set "Missing" and "Other" as top levels


# Create weekly incidence object, grouped by hospital and week
hospital_outbreak_mod <- incidence(
  plot_data,
  date_index = date_onset, 
  interval = "week", 
  groups = hospital)

# plot incidence object
plot(hospital_outbreak_mod, fill = hospital)+
  
  # manual specify colors
  scale_fill_manual(values = c("grey", "beige", "darkgreen", "green2", "orange", "red", "pink"))+                      

  # labels added via ggplot
  labs(
      title = "MODIFIED - hospital as factor",   # plot title
      subtitle = "Other & Missing at top of epicurve",
      y = "Weekly case incidence",               # y axis title  
      x = "Week of symptom onset",               # x axis title
      fill = "Hospital")                         # title of legend     

TIP: If you want to reverse the order of the legend only, add this ggplot2 command guides(fill = guide_legend(reverse = TRUE)).

Vertical gridlines

If you plot with default incidence2 settings, you may notice that the vertical gridlines appear at each date label and once between each date label. This can result in gridlines intersecting with the top of some bars.

You can remove all gridlines by adding the ggplot2 command theme_classic().

# make incidence object
a <- incidence(
  central_data,
  date_index = date_onset,
  interval = "Monday weeks"
)

# Default gridlines
plot(a, title = "Default lines")

# Specified gridline intervals
# NOT WORKING WITH INCIDENCE2 1.0.0
# plot(a, title = "Weekly lines")+
#   scale_x_date(
#     date_breaks = "4 weeks",      # major vertical lines align on weeks
#     date_minor_breaks = "weeks",  # minor vertical lines every week
#     date_labels = "%a\n%d\n%b")   # format of date labels

# No gridlines
plot(a, title = "No lines")+
  theme_classic()                 # remove all gridlines

Note however, that if using weeks, the date_breaks and date_minor_breaks arguments only work for Monday weeks. If your weeks are by another day of the week you will need to manually provide a vector of dates to the breaks = and minor_breaks = arguments instead. See the ggplot2 section for examples of this using seq.Date().

Cumulative incidence

You can easily produce a plot of cumulative incidence by passing the incidence object to the incidence2 command cumulate() and then to plot(). This also works with facet_plot().

# make weekly incidence object
wkly_inci <- incidence(
  linelist,
  date_index = date_onset,
  interval = "week"
)
## 256 missing observations were removed.
# plot cumulative incidence
wkly_inci %>% 
  cumulate() %>% 
  plot()

See the section farther down on this page for alternative method to plot cumulative incidence with ggplot2 - for example to overlay a cumulative incidence line over an epicurve.

Rolling average

You can add a rolling average to an incidence2 plot easily with add_rolling_average() from the i2extras package. Pass your incidence2 object to this function, and then to plot(). Set before = as the number of previous days you want included in the rolling average (default is 2). If your data are grouped, the rolling average will be calculated per group.

rolling_avg <- incidence(                    # make incidence object
  linelist,
  date_index = date_onset,
  interval = "week",
  groups = gender) %>% 
  
  i2extras::add_rolling_average(before = 6)  # add rolling averages (in this case, by gender)

# plot
plot(rolling_avg, n.breaks = 3) # faceted automatically because rolling average on groups

To learn how to apply rolling averages more generally on data, see the Handbook page on Moving averages.

32.3 Epicurves with ggplot2

Using ggplot() to build your epicurve allows for more flexibility and customization, but requires more effort and understanding of how ggplot() works.

Unlike using the incidence2 package, you must manually control the aggregation of the cases by time (into weeks, months, etc) and the intervals of the labels on the date axis. This must be carefully managed.

These examples use a subset of the linelist dataset - only the cases from Central Hospital.

central_data <- linelist %>% 
  filter(hospital == "Central Hospital")

To produce an epicurve with ggplot() there are three main elements:

  • A histogram, with linelist cases aggregated into “bins” distinguished by specific “break” points
  • Scales for the axes and their labels
  • Themes for the plot appearance, including titles, labels, captions, etc.

Specify case bins

Here we show how to specify how cases will be aggregated into histogram bins (“bars”). It is important to recognize that the aggregation of cases into histogram bins is not necessarily the same intervals as the dates that will appear on the x-axis.

Below is perhaps the most simple code to produce daily and weekly epicurves.

In the over-arching ggplot() command the dataset is provided to data =. Onto this foundation, the geometry of a histogram is added with a +. Within the geom_histogram(), we map the aesthetics such that the column date_onset is mapped to the x-axis. Also within the geom_histogram() but not within aes() we set the binwidth = of the histogram bins, in days. If this ggplot2 syntax is confusing, review the page on ggplot basics.

CAUTION: Plotting weekly cases by using binwidth = 7 starts the first 7-day bin at the first case, which could be any day of the week! To create specific weeks, see section below .

# daily 
ggplot(data = central_data) +          # set data
  geom_histogram(                      # add histogram
    mapping = aes(x = date_onset),     # map date column to x-axis
    binwidth = 1)+                     # cases binned by 1 day 
  labs(title = "Central Hospital - Daily")                # title

# weekly
ggplot(data = central_data) +          # set data 
  geom_histogram(                      # add histogram
      mapping = aes(x = date_onset),   # map date column to x-axis
      binwidth = 7)+                   # cases binned every 7 days, starting from first case (!) 
  labs(title = "Central Hospital - 7-day bins, starting at first case") # title

Let us note that the first case in this Central Hospital dataset had symptom onset on:

format(min(central_data$date_onset, na.rm=T), "%A %d %b, %Y")
## [1] "Thursday 01 May, 2014"

To manually specify the histogram bin breaks, do not use the binwidth = argument, and instead supply a vector of dates to breaks =.

Create the vector of dates with the base R function seq.Date(). This function expects arguments to =, from =, and by =. For example, the command below returns monthly dates starting at Jan 15 and ending by June 28.

monthly_breaks <- seq.Date(from = as.Date("2014-02-01"),
                           to = as.Date("2015-07-15"),
                           by = "months")

monthly_breaks   # print
##  [1] "2014-02-01" "2014-03-01" "2014-04-01" "2014-05-01" "2014-06-01" "2014-07-01" "2014-08-01" "2014-09-01" "2014-10-01" "2014-11-01" "2014-12-01" "2015-01-01"
## [13] "2015-02-01" "2015-03-01" "2015-04-01" "2015-05-01" "2015-06-01" "2015-07-01"

This vector can be provided to geom_histogram() as breaks =:

# monthly 
ggplot(data = central_data) +  
  geom_histogram(
    mapping = aes(x = date_onset),
    breaks = monthly_breaks)+         # provide the pre-defined vector of breaks                    
  labs(title = "Monthly case bins")   # title

A simple weekly date sequence can be returned by setting by = "week". For example:

weekly_breaks <- seq.Date(from = as.Date("2014-02-01"),
                          to = as.Date("2015-07-15"),
                          by = "week")

An alternative to supplying specific start and end dates is to write dynamic code so that weekly bins begin the Monday before the first case. We will use these date vectors throughout the examples below.

# Sequence of weekly Monday dates for CENTRAL HOSPITAL
weekly_breaks_central <- seq.Date(
  from = floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1), # monday before first case
  to   = ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1), # monday after last case
  by   = "week")

Let’s unpack the rather daunting code above:

  • The “from” value (earliest date of the sequence) is created as follows: the minimum date value (min() with na.rm=TRUE) in the column date_onset is fed to floor_date() from the lubridate package. floor_date() set to “week” returns the start date of that cases’s “week”, given that the start day of each week is a Monday (week_start = 1).
  • Likewise, the “to” value (end date of the sequence) is created using the inverse function ceiling_date() to return the Monday after the last case.
  • The “by” argument of seq.Date() can be set to any number of days, weeks, or months.
  • Use week_start = 7 for Sunday weeks

As we will use these date vectors throughout this page, we also define one for the whole outbreak (the above is for Central Hospital only).

# Sequence for the entire outbreak
weekly_breaks_all <- seq.Date(
  from = floor_date(min(linelist$date_onset, na.rm=T),   "week", week_start = 1), # monday before first case
  to   = ceiling_date(max(linelist$date_onset, na.rm=T), "week", week_start = 1), # monday after last case
  by   = "week")

These seq.Date() outputs can be used to create histogram bin breaks, but also the breaks for the date labels, which may be independent from the bins. Read more about the date labels in later sections.

TIP: For a more simple ggplot() command, save the bin breaks and date label breaks as named vectors in advance, and simply provide their names to breaks =.

Weekly epicurve example

Below is detailed example code to produce weekly epicurves for Monday weeks, with aligned bars, date labels, and vertical gridlines. This section is for the user who needs code quickly. To understand each aspect (themes, date labels, etc.) in-depth, continue to the subsequent sections. Of note:

  • The histogram bin breaks are defined with seq.Date() as explained above to begin the Monday before the earliest case and to end the Monday after the last case
  • The interval of date labels is specified by date_breaks = within scale_x_date()
  • The interval of minor vertical gridlines between date labels is specified to date_minor_breaks =
  • expand = c(0,0) in the x and y scales removes excess space on each side of the axes, which also ensures the date labels begin from the first bar.
# TOTAL MONDAY WEEK ALIGNMENT
#############################
# Define sequence of weekly breaks
weekly_breaks_central <- seq.Date(
      from = floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1), # Monday before first case
      to   = ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1), # Monday after last case
      by   = "week")    # bins are 7-days 


ggplot(data = central_data) + 
  
  # make histogram: specify bin break points: starts the Monday before first case, end Monday after last case
  geom_histogram(
    
    # mapping aesthetics
    mapping = aes(x = date_onset),  # date column mapped to x-axis
    
    # histogram bin breaks
    breaks = weekly_breaks_central, # histogram bin breaks defined previously
    
    # bars
    color = "darkblue",     # color of lines around bars
    fill = "lightblue"      # color of fill within bars
  )+ 
    
  # x-axis labels
  scale_x_date(
    expand            = c(0,0),           # remove excess x-axis space before and after case bars
    date_breaks       = "4 weeks",        # date labels and major vertical gridlines appear every 3 Monday weeks
    date_minor_breaks = "week",           # minor vertical lines appear every Monday week
    date_labels       = "%a\n%d %b\n%Y")+ # date labels format
  
  # y-axis
  scale_y_continuous(
    expand = c(0,0))+             # remove excess y-axis space below 0 (align histogram flush with x-axis)
  
  # aesthetic themes
  theme_minimal()+                # simplify plot background
  
  theme(
    plot.caption = element_text(hjust = 0,        # caption on left side
                                face = "italic"), # caption in italics
    axis.title = element_text(face = "bold"))+    # axis titles in bold
  
  # labels including dynamic caption
  labs(
    title    = "Weekly incidence of cases (Monday weeks)",
    subtitle = "Note alignment of bars, vertical gridlines, and axis labels on Monday weeks",
    x        = "Week of symptom onset",
    y        = "Weekly incident cases reported",
    caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))

Sunday weeks

To achieve the above plot for Sunday weeks a few modifications are needed, because the date_breaks = "weeks" only work for Monday weeks.

  • The break points of the histogram bins must be set to Sundays (week_start = 7)
  • Within scale_x_date(), the similar date breaks should be provided to breaks = and minor_breaks = to ensure the date labels and vertical gridlines align on Sundays.

For example, the scale_x_date() command for Sunday weeks could look like this:

scale_x_date(
    expand = c(0,0),
    
    # specify interval of date labels and major vertical gridlines
    breaks = seq.Date(
      from = floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7), # Sunday before first case
      to   = ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7), # Sunday after last case
      by   = "4 weeks"),
    
    # specify interval of minor vertical gridline 
    minor_breaks = seq.Date(
      from = floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7), # Sunday before first case
      to   = ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7), # Sunday after last case
      by   = "week"),
   
    # date label format
    date_labels = "%a\n%d %b\n%Y")+         # day, above month abbrev., above 2-digit year

Group/color by value

The histogram bars can be colored by group and “stacked”. To designate the grouping column, make the following changes. See the ggplot basics page for details.

  • Within the histogram aesthetic mapping aes(), map the column name to the group = and fill = arguments
  • Remove any fill = argument outside of aes(), as it will override the one inside
  • Arguments inside aes() will apply by group, whereas any outside will apply to all bars (e.g. you may still want color = outside, so each bar has the same border)

Here is what the aes() command would look like to group and color the bars by gender:

aes(x = date_onset, group = gender, fill = gender)

Here it is applied:

ggplot(data = linelist) +     # begin with linelist (many hospitals)
  
  # make histogram: specify bin break points: starts the Monday before first case, end Monday after last case
  geom_histogram(
    mapping = aes(
      x = date_onset,
      group = hospital,       # set data to be grouped by hospital
      fill = hospital),       # bar fill (inside color) by hospital
    
    # bin breaks are Monday weeks
    breaks = weekly_breaks_all,   # sequence of weekly Monday bin breaks for whole outbreak, defined in previous code       
    
    # Color around bars
    color = "black")

Adjust colors

  • To manually set the fill for each group, use scale_fill_manual() (note: scale_color_manual() is different!).
    • Use the values = argument to apply a vector of colors.
    • Use na.value = to specify a color for NA values.
    • Use the labels = argument to change the text of legend items. To be safe, provide as a named vector like c("old" = "new", "old" = "new") or adjust the values in the data itself.
    • Use name = to give a proper title to the legend
  • For more tips on color scales and palettes, see the page on ggplot basics.
ggplot(data = linelist)+           # begin with linelist (many hospitals)
  
  # make histogram
  geom_histogram(
    mapping = aes(x = date_onset,
        group = hospital,          # cases grouped by hospital
        fill = hospital),          # bar fill by hospital
    
    # bin breaks
    breaks = weekly_breaks_all,        # sequence of weekly Monday bin breaks, defined in previous code
    
    # Color around bars
    color = "black")+              # border color of each bar
  
  # manual specification of colors
  scale_fill_manual(
    values = c("black", "orange", "grey", "beige", "blue", "brown"),
    labels = c("St. Mark's Maternity Hospital (SMMH)" = "St. Mark's"),
    name = "Hospital") # specify fill colors ("values") - attention to order!

Adjust level order

The order in which grouped bars are stacked is best adjusted by classifying the grouping column as class Factor. You can then designate the factor level order (and their display labels). See the page on Factors or ggplot tips for details.

Before making the plot, use the fct_relevel() function from forcats package to convert the grouping column to class factor and manually adjust the level order, as detailed in the page on Factors.

# load forcats package for working with factors
pacman::p_load(forcats)

# Define new dataset with hospital as factor
plot_data <- linelist %>% 
  mutate(hospital = fct_relevel(hospital, c("Missing", "Other"))) # Convert to factor and set "Missing" and "Other" as top levels to appear on epicurve top

levels(plot_data$hospital) # print levels in order
## [1] "Missing"                              "Other"                                "Central Hospital"                     "Military Hospital"                   
## [5] "Port Hospital"                        "St. Mark's Maternity Hospital (SMMH)"

In the below plot, the only differences from previous is that column hospital has been consolidated as above, and we use guides() to reverse the legend order, so that “Missing” is on the bottom of the legend.

ggplot(plot_data) +                     # Use NEW dataset with hospital as re-ordered factor
  
  # make histogram
  geom_histogram(
    mapping = aes(x = date_onset,
        group = hospital,               # cases grouped by hospital
        fill = hospital),               # bar fill (color) by hospital
    
    breaks = weekly_breaks_all,         # sequence of weekly Monday bin breaks for whole outbreak, defined at top of ggplot section
    
    color = "black")+                   # border color around each bar
    
  # x-axis labels
  scale_x_date(
    expand            = c(0,0),         # remove excess x-axis space before and after case bars
    date_breaks       = "3 weeks",      # labels appear every 3 Monday weeks
    date_minor_breaks = "week",         # vertical lines appear every Monday week
    date_labels       = "%d\n%b\n'%y")+ # date labels format
  
  # y-axis
  scale_y_continuous(
    expand = c(0,0))+                   # remove excess y-axis space below 0
  
  # manual specification of colors, ! attention to order
  scale_fill_manual(
    values = c("grey", "beige", "black", "orange", "blue", "brown"),
    labels = c("St. Mark's Maternity Hospital (SMMH)" = "St. Mark's"),
    name = "Hospital")+ 
  
  # aesthetic themes
  theme_minimal()+                      # simplify plot background
  
  theme(
    plot.caption = element_text(face = "italic", # caption on left side in italics
                                hjust = 0), 
    axis.title = element_text(face = "bold"))+   # axis titles in bold
  
  # labels
  labs(
    title    = "Weekly incidence of cases by hospital",
    subtitle = "Hospital as re-ordered factor",
    x        = "Week of symptom onset",
    y        = "Weekly cases")

TIP: To reverse the order of the legend only, add this ggplot2 command: guides(fill = guide_legend(reverse = TRUE)).

Adjust legend

Read more about legends and scales in the ggplot tips page. Here are a few highlights:

  • Edit legend title either in the scale function or with labs(fill = "Legend title") (if your are using color = aesthetic, then use labs(color = ""))
  • theme(legend.title = element_blank()) to have no legend title
  • theme(legend.position = "top") (“bottom”, “left”, “right”, or “none” to remove the legend)
  • theme(legend.direction = "horizontal") horizontal legend
  • guides(fill = guide_legend(reverse = TRUE)) to reverse order of the legend

Bars side-by-side

Side-by-side display of group bars (as opposed to stacked) is specified within geom_histogram() with position = "dodge" placed outside of aes().

If there are more than two value groups, these can become difficult to read. Consider instead using a faceted plot (small multiples). To improve readability in this example, missing gender values are removed.

ggplot(central_data %>% drop_na(gender))+   # begin with Central Hospital cases dropping missing gender
    geom_histogram(
        mapping = aes(
          x = date_onset,
          group = gender,         # cases grouped by gender
          fill = gender),         # bars filled by gender
        
        # histogram bin breaks
        breaks = weekly_breaks_central,   # sequence of weekly dates for Central outbreak - defined at top of ggplot section
        
        color = "black",          # bar edge color
        
        position = "dodge")+      # SIDE-BY-SIDE bars
                      
  
  # The labels on the x-axis
  scale_x_date(expand            = c(0,0),         # remove excess x-axis space below and after case bars
               date_breaks       = "3 weeks",      # labels appear every 3 Monday weeks
               date_minor_breaks = "week",         # vertical lines appear every Monday week
               date_labels       = "%d\n%b\n'%y")+ # date labels format
  
  # y-axis
  scale_y_continuous(expand = c(0,0))+             # removes excess y-axis space between bottom of bars and the labels
  
  #scale of colors and legend labels
  scale_fill_manual(values = c("brown", "orange"),  # specify fill colors ("values") - attention to order!
                    na.value = "grey" )+     

  # aesthetic themes
  theme_minimal()+                                               # a set of themes to simplify plot
  theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
        axis.title = element_text(face = "bold"))+               # axis titles in bold
  
  # labels
  labs(title    = "Weekly incidence of cases, by gender",
       subtitle = "Subtitle",
       fill     = "Gender",                                      # provide new title for legend
       x        = "Week of symptom onset",
       y        = "Weekly incident cases reported")

Axis limits

There are two ways to limit the extent of axis values.

Generally the preferred way is to use the command coord_cartesian(), which accepts xlim = c(min, max) and ylim = c(min, max) (where you provide the min and max values). This acts as a “zoom” without actually removing any data, which is important for statistics and summary measures.

Alternatively, you can set maximum and minimum date values using limits = c() within scale_x_date(). For example:

scale_x_date(limits = c(as.Date("2014-04-01"), NA)) # sets a minimum date but leaves the maximum open.  

Likewise, if you want to the x-axis to extend to a specific date (e.g. current date), even if no new cases have been reported, you can use:

scale_x_date(limits = c(NA, Sys.Date()) # ensures date axis will extend until current date  

DANGER: Be cautious setting the y-axis scale breaks or limits (e.g. 0 to 30 by 5: seq(0, 30, 5)). Such static numbers can cut-off your plot too short if the data changes to exceed the limit!.

Date-axis labels/gridlines

TIP: Remember that date-axis labels are independent from the aggregation of the data into bars, but visually it can be important to align bins, date labels, and vertical grid lines.

To modify the date labels and grid lines, use scale_x_date() in one of these ways:

  • If your histogram bins are days, Monday weeks, months, or years:
    • Use date_breaks = to specify the interval of labels and major gridlines (e.g. “day”, “week”, “3 weeks”, “month”, or “year”)
    • Use date_minor_breaks = to specify interval of minor vertical gridlines (between date labels)
    • Add expand = c(0,0) to begin the labels at the first bar
    • Use date_labels = to specify format of date labels - see the Dates page for tips (use \n for a new line)
  • If your histogram bins are Sunday weeks:
    • Use breaks = and minor_breaks = by providing a sequence of date breaks for each
    • You can still use date_labels = and expand = for formatting as described above

Some notes:

  • See the opening ggplot section for instructions on how to create a sequence of dates using seq.Date().
  • See this page or the Working with dates page for tips on creating date labels.

Demonstrations

Below is a demonstration of plots where the bins and the plot labels/grid lines are aligned and not aligned:

# 7-day bins + Monday labels
#############################
ggplot(central_data) +
  geom_histogram(
    mapping = aes(x = date_onset),
    binwidth = 7,                 # 7-day bins with start at first case
    color = "darkblue",
    fill = "lightblue") +
  
  scale_x_date(
    expand = c(0,0),               # remove excess x-axis space below and after case bars
    date_breaks = "3 weeks",       # Monday every 3 weeks
    date_minor_breaks = "week",    # Monday weeks
    date_labels = "%a\n%d\n%b\n'%y")+  # label format
  
  scale_y_continuous(
    expand = c(0,0))+              # remove excess space under x-axis, make flush
  
  labs(
    title = "MISALIGNED",
    subtitle = "! CAUTION: 7-day bars start Thursdays at first case\nDate labels and gridlines on Mondays\nNote how ticks don't align with bars")



# 7-day bins + Months
#####################
ggplot(central_data) +
  geom_histogram(
    mapping = aes(x = date_onset),
    binwidth = 7,
    color = "darkblue",
    fill = "lightblue") +
  
  scale_x_date(
    expand = c(0,0),                  # remove excess x-axis space below and after case bars
    date_breaks = "months",           # 1st of month
    date_minor_breaks = "week",       # Monday weeks
    date_labels = "%a\n%d %b\n%Y")+    # label format
  
  scale_y_continuous(
    expand = c(0,0))+                # remove excess space under x-axis, make flush 
  
  labs(
    title = "MISALIGNED",
    subtitle = "! CAUTION: 7-day bars start Thursdays with first case\nMajor gridlines and date labels at 1st of each month\nMinor gridlines weekly on Mondays\nNote uneven spacing of some gridlines and ticks unaligned with bars")


# TOTAL MONDAY ALIGNMENT: specify manual bin breaks to be mondays
#################################################################
ggplot(central_data) + 
  geom_histogram(
    mapping = aes(x = date_onset),
    
    # histogram breaks set to 7 days beginning Monday before first case
    breaks = weekly_breaks_central,    # defined earlier in this page
    
    color = "darkblue",
    
    fill = "lightblue") + 
  
  scale_x_date(
    expand = c(0,0),                   # remove excess x-axis space below and after case bars
    date_breaks = "4 weeks",           # Monday every 4 weeks
    date_minor_breaks = "week",        # Monday weeks 
    date_labels = "%a\n%d %b\n%Y")+      # label format
  
  scale_y_continuous(
    expand = c(0,0))+                # remove excess space under x-axis, make flush 
  
  labs(
    title = "ALIGNED Mondays",
    subtitle = "7-day bins manually set to begin Monday before first case (28 Apr)\nDate labels and gridlines on Mondays as well")


# TOTAL MONDAY ALIGNMENT WITH MONTHS LABELS:
############################################
ggplot(central_data) + 
  geom_histogram(
    mapping = aes(x = date_onset),
    
    # histogram breaks set to 7 days beginning Monday before first case
    breaks = weekly_breaks_central,            # defined earlier in this page
    
    color = "darkblue",
    
    fill = "lightblue") + 
  
  scale_x_date(
    expand = c(0,0),                   # remove excess x-axis space below and after case bars
    date_breaks = "months",            # Monday every 4 weeks
    date_minor_breaks = "week",        # Monday weeks 
    date_labels = "%b\n%Y")+          # label format
  
  scale_y_continuous(
    expand = c(0,0))+                # remove excess space under x-axis, make flush 
  
  theme(panel.grid.major = element_blank())+  # Remove major gridlines (fall on 1st of month)
          
  labs(
    title = "ALIGNED Mondays with MONTHLY labels",
    subtitle = "7-day bins manually set to begin Monday before first case (28 Apr)\nDate labels on 1st of Month\nMonthly major gridlines removed")


# TOTAL SUNDAY ALIGNMENT: specify manual bin breaks AND labels to be Sundays
############################################################################
ggplot(central_data) + 
  geom_histogram(
    mapping = aes(x = date_onset),
    
    # histogram breaks set to 7 days beginning Sunday before first case
    breaks = seq.Date(from = floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7),
                      to   = ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7),
                      by   = "7 days"),
    
    color = "darkblue",
    
    fill = "lightblue") + 
  
  scale_x_date(
    expand = c(0,0),
    # date label breaks and major gridlines set to every 3 weeks beginning Sunday before first case
    breaks = seq.Date(from = floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7),
                      to   = ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7),
                      by   = "3 weeks"),
    
    # minor gridlines set to weekly beginning Sunday before first case
    minor_breaks = seq.Date(from = floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7),
                            to   = ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7),
                            by   = "7 days"),
    
    date_labels = "%a\n%d\n%b\n'%y")+  # label format
  
  scale_y_continuous(
    expand = c(0,0))+                # remove excess space under x-axis, make flush 
  
  labs(title = "ALIGNED Sundays",
       subtitle = "7-day bins manually set to begin Sunday before first case (27 Apr)\nDate labels and gridlines manually set to Sundays as well")

Aggregated data

Often instead of a linelist, you begin with aggregated counts from facilities, districts, etc. You can make an epicurve with ggplot() but the code will be slightly different. This section will utilize the count_data dataset that was imported earlier, in the data preparation section. This dataset is the linelist aggregated to day-hospital counts. The first 50 rows are displayed below.

Plotting daily counts

We can plot a daily epicurve from these daily counts. Here are the differences to the code:

  • Within the aesthetic mapping aes(), specify y = as the counts column (in this case, the column name is n_cases)
  • Add the argument stat = "identity" within geom_histogram(), which specifies that bar height should be the y = value, not the number of rows as is the default
  • Add the argument width = to avoid vertical white lines between the bars. For daily data set to 1. For weekly count data set to 7. For monthly count data, white lines are an issue (each month has different number of days) - consider transforming your x-axis to a categorical ordered factor (months) and using geom_col().
ggplot(data = count_data)+
  geom_histogram(
   mapping = aes(x = date_hospitalisation, y = n_cases),
   stat = "identity",
   width = 1)+                # for daily counts, set width = 1 to avoid white space between bars
  labs(
    x = "Date of report", 
    y = "Number of cases",
    title = "Daily case incidence, from daily count data")

Plotting weekly counts

If your data are already case counts by week, they might look like this dataset (called count_data_weekly):

The first 50 rows of count_data_weekly are displayed below. You can see that the counts have been aggregated into weeks. Each week is displayed by the first day of the week (Monday by default).

Now plot so that x = the epiweek column. Remember to add y = the counts column to the aesthetic mapping, and add stat = "identity" as explained above.

ggplot(data = count_data_weekly)+
  
  geom_histogram(
    mapping = aes(
      x = epiweek,           # x-axis is epiweek (as class Date)
      y = n_cases_weekly,    # y-axis height in the weekly case counts
      group = hospital,      # we are grouping the bars and coloring by hospital
      fill = hospital),
    stat = "identity")+      # this is also required when plotting count data
     
  # labels for x-axis
  scale_x_date(
    date_breaks = "2 months",      # labels every 2 months 
    date_minor_breaks = "1 month", # gridlines every month
    date_labels = '%b\n%Y')+       #labeled by month with year below
     
  # Choose color palette (uses RColorBrewer package)
  scale_fill_brewer(palette = "Pastel2")+ 
  
  theme_minimal()+
  
  labs(
    x = "Week of onset", 
    y = "Weekly case incidence",
    fill = "Hospital",
    title = "Weekly case incidence, from aggregated count data by hospital")

Moving averages

See the page on Moving averages for a detailed description and several options. Below is one option for calculating moving averages with the package slider. In this approach, the moving average is calculated in the dataset prior to plotting:

  1. Aggregate the data into counts as necessary (daily, weekly, etc.) (see Grouping data page)
  2. Create a new column to hold the moving average, created with slide_index() from slider package
  3. Plot the moving average as a geom_line() on top of (after) the epicurve histogram

See the helpful online vignette for the slider package

# load package
pacman::p_load(slider)  # slider used to calculate rolling averages

# make dataset of daily counts and 7-day moving average
#######################################################
ll_counts_7day <- linelist %>%    # begin with linelist
  
  ## count cases by date
  count(date_onset, name = "new_cases") %>%   # name new column with counts as "new_cases"
  drop_na(date_onset) %>%                     # remove cases with missing date_onset
  
  ## calculate the average number of cases in 7-day window
  mutate(
    avg_7day = slider::slide_index(    # create new column
      new_cases,                       # calculate based on value in new_cases column
      .i = date_onset,                 # index is date_onset col, so non-present dates are included in window 
      .f = ~mean(.x, na.rm = TRUE),    # function is mean() with missing values removed
      .before = 6,                     # window is the day and 6-days before
      .complete = FALSE),              # must be FALSE for unlist() to work in next step
    avg_7day = unlist(avg_7day))       # convert class list to class numeric


# plot
######
ggplot(data = ll_counts_7day) +  # begin with new dataset defined above 
    geom_histogram(              # create epicurve histogram
      mapping = aes(
        x = date_onset,          # date column as x-axis
        y = new_cases),          # height is number of daily new cases
        stat = "identity",       # height is y value
        fill="#92a8d1",          # cool color for bars
        colour = "#92a8d1",      # same color for bar border
        )+ 
    geom_line(                   # make line for rolling average
      mapping = aes(
        x = date_onset,          # date column for x-axis
        y = avg_7day,            # y-value set to rolling average column
        lty = "7-day \nrolling avg"), # name of line in legend
      color="red",               # color of line
      size = 1) +                # width of line
    scale_x_date(                # date scale
      date_breaks = "1 month",
      date_labels = '%d/%m',
      expand = c(0,0)) +
    scale_y_continuous(          # y-axis scale
      expand = c(0,0),
      limits = c(0, NA)) +       
    labs(
      x="",
      y ="Number of confirmed cases",
      fill = "Legend")+ 
    theme_minimal()+
    theme(legend.title = element_blank())  # removes title of legend

Faceting/small-multiples

As with other ggplots, you can create facetted plots (“small multiples”). As explained in the ggplot tips page of this handbook, you can use either facet_wrap() or facet_grid(). Here we demonstrate with facet_wrap(). For epicurves, facet_wrap() is typically easier as it is likely that you only need to facet on one column.

The general syntax is facet_wrap(rows ~ cols), where to the left of the tilde (~) is the name of a column to be spread across the “rows” of the facetted plot, and to the right of the tilde is the name of a column to be spread across the “columns” of the facetted plot. Most simply, just use one column name, to the right of the tilde: facet_wrap(~age_cat).

Free axes
You will need to decide whether the scales of the axes for each facet are “fixed” to the same dimensions (default), or “free” (meaning they will change based on the data within the facet). Do this with the scales = argument within facet_wrap() by specifying “free_x” or “free_y”, or “free”.

Number of cols and rows of facets
This can be specified with ncol = and nrow = within facet_wrap().

Order of panels
To change the order of appearance, change the underlying order of the levels of the factor column used to create the facets.

Aesthetics
Font size and face, strip color, etc. can be modified through theme() with arguments like:

  • strip.text = element_text() (size, colour, face, angle…)
  • strip.background = element_rect() (e.g. element_rect(fill=“grey”))
  • strip.position = (position of the strip “bottom”, “top”, “left”, or “right”)

Strip labels
Labels of the facet plots can be modified through the “labels” of the column as a factor, or by the use of a “labeller”.

Make a labeller like this, using the function as_labeller() from ggplot2. Then provide the labeller to the labeller = argument of facet_wrap() as shown below.

my_labels <- as_labeller(c(
     "0-4"   = "Ages 0-4",
     "5-9"   = "Ages 5-9",
     "10-14" = "Ages 10-14",
     "15-19" = "Ages 15-19",
     "20-29" = "Ages 20-29",
     "30-49" = "Ages 30-49",
     "50-69" = "Ages 50-69",
     "70+"   = "Over age 70"))

An example facetted plot - facetted by column age_cat.

# make plot
###########
ggplot(central_data) + 
  
  geom_histogram(
    mapping = aes(
      x = date_onset,
      group = age_cat,
      fill = age_cat),    # arguments inside aes() apply by group
      
    color = "black",      # arguments outside aes() apply to all data
        
    # histogram breaks
    breaks = weekly_breaks_central)+  # pre-defined date vector (see earlier in this page)
                      
  # The labels on the x-axis
  scale_x_date(
    expand            = c(0,0),         # remove excess x-axis space below and after case bars
    date_breaks       = "2 months",     # labels appear every 2 months
    date_minor_breaks = "1 month",      # vertical lines appear every 1 month 
    date_labels       = "%b\n'%y")+     # date labels format
  
  # y-axis
  scale_y_continuous(expand = c(0,0))+                       # removes excess y-axis space between bottom of bars and the labels
  
  # aesthetic themes
  theme_minimal()+                                           # a set of themes to simplify plot
  theme(
    plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
    axis.title = element_text(face = "bold"),
    legend.position = "bottom",
    strip.text = element_text(face = "bold", size = 10),
    strip.background = element_rect(fill = "grey"))+         # axis titles in bold
  
  # create facets
  facet_wrap(
    ~age_cat,
    ncol = 4,
    strip.position = "top",
    labeller = my_labels)+             
  
  # labels
  labs(
    title    = "Weekly incidence of cases, by age category",
    subtitle = "Subtitle",
    fill     = "Age category",                                      # provide new title for legend
    x        = "Week of symptom onset",
    y        = "Weekly incident cases reported",
    caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))

See this link for more information on labellers.

Total epidemic in facet background

To show the total epidemic in the background of each facet, add the function gghighlight() with empty parentheses to the ggplot. This is from the package gghighlight. Note that the y-axis maximum in all facets is now based on the peak of the entire epidemic. There are more examples of this package in the ggplot tips page.

ggplot(central_data) + 
  
  # epicurves by group
  geom_histogram(
    mapping = aes(
      x = date_onset,
      group = age_cat,
      fill = age_cat),  # arguments inside aes() apply by group
    
    color = "black",    # arguments outside aes() apply to all data
    
    # histogram breaks
    breaks = weekly_breaks_central)+     # pre-defined date vector (see top of ggplot section)                
  
  # add grey epidemic in background to each facet
  gghighlight::gghighlight()+
  
  # labels on x-axis
  scale_x_date(
    expand            = c(0,0),         # remove excess x-axis space below and after case bars
    date_breaks       = "2 months",     # labels appear every 2 months
    date_minor_breaks = "1 month",      # vertical lines appear every 1 month 
    date_labels       = "%b\n'%y")+     # date labels format
  
  # y-axis
  scale_y_continuous(expand = c(0,0))+  # removes excess y-axis space below 0
  
  # aesthetic themes
  theme_minimal()+                                           # a set of themes to simplify plot
  theme(
    plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
    axis.title = element_text(face = "bold"),
    legend.position = "bottom",
    strip.text = element_text(face = "bold", size = 10),
    strip.background = element_rect(fill = "white"))+        # axis titles in bold
  
  # create facets
  facet_wrap(
    ~age_cat,                          # each plot is one value of age_cat
    ncol = 4,                          # number of columns
    strip.position = "top",            # position of the facet title/strip
    labeller = my_labels)+             # labeller defines above
  
  # labels
  labs(
    title    = "Weekly incidence of cases, by age category",
    subtitle = "Subtitle",
    fill     = "Age category",                                      # provide new title for legend
    x        = "Week of symptom onset",
    y        = "Weekly incident cases reported",
    caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))

One facet with data

If you want to have one facet box that contains all the data, duplicate the entire dataset and treat the duplicates as one faceting value. A “helper” function CreateAllFacet() below can assist with this (thanks to this blog post). When it is run, the number of rows doubles and there will be a new column called facet in which the duplicated rows will have the value “all”, and the original rows have the their original value of the faceting colum. Now you just have to facet on the facet column.

Here is the helper function. Run it so that it is available to you.

# Define helper function
CreateAllFacet <- function(df, col){
     df$facet <- df[[col]]
     temp <- df
     temp$facet <- "all"
     merged <-rbind(temp, df)
     
     # ensure the facet value is a factor
     merged[[col]] <- as.factor(merged[[col]])
     
     return(merged)
}

Now apply the helper function to the dataset, on column age_cat:

# Create dataset that is duplicated and with new column "facet" to show "all" age categories as another facet level
central_data2 <- CreateAllFacet(central_data, col = "age_cat") %>%
  
  # set factor levels
  mutate(facet = fct_relevel(facet, "all", "0-4", "5-9",
                             "10-14", "15-19", "20-29",
                             "30-49", "50-69", "70+"))
## Warning: Unknown levels in `f`: 70+
# check levels
table(central_data2$facet, useNA = "always")
## 
##   all   0-4   5-9 10-14 15-19 20-29 30-49 50-69  <NA> 
##   454    84    84    82    58    73    57     7     9

Notable changes to the ggplot() command are:

  • The data used is now central_data2 (double the rows, with new column “facet”)
  • Labeller will need to be updated, if used
  • Optional: to achieve vertically stacked facets: the facet column is moved to rows side of equation and on right is replaced by “.” (facet_wrap(facet~.)), and ncol = 1. You may also need to adjust the width and height of the saved png plot image (see ggsave() in ggplot tips).
ggplot(central_data2) + 
  
  # actual epicurves by group
  geom_histogram(
        mapping = aes(
          x = date_onset,
          group = age_cat,
          fill = age_cat),  # arguments inside aes() apply by group
        color = "black",    # arguments outside aes() apply to all data
        
        # histogram breaks
        breaks = weekly_breaks_central)+    # pre-defined date vector (see top of ggplot section)
                     
  # Labels on x-axis
  scale_x_date(
    expand            = c(0,0),         # remove excess x-axis space below and after case bars
    date_breaks       = "2 months",     # labels appear every 2 months
    date_minor_breaks = "1 month",      # vertical lines appear every 1 month 
    date_labels       = "%b\n'%y")+     # date labels format
  
  # y-axis
  scale_y_continuous(expand = c(0,0))+  # removes excess y-axis space between bottom of bars and the labels
  
  # aesthetic themes
  theme_minimal()+                                           # a set of themes to simplify plot
  theme(
    plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
    axis.title = element_text(face = "bold"),
    legend.position = "bottom")+               
  
  # create facets
  facet_wrap(facet~. ,                            # each plot is one value of facet
             ncol = 1)+            

  # labels
  labs(title    = "Weekly incidence of cases, by age category",
       subtitle = "Subtitle",
       fill     = "Age category",                                      # provide new title for legend
       x        = "Week of symptom onset",
       y        = "Weekly incident cases reported",
       caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))

32.4 Tentative data

The most recent data shown in epicurves should often be marked as tentative, or subject to reporting delays. This can be done in by adding a vertical line and/or rectangle over a specified number of days. Here are two options:

  1. Use annotate():
    • For a line use annotate(geom = "segment"). Provide x, xend, y, and yend. Adjust size, linetype (lty), and color.
    • For a rectangle use annotate(geom = "rect"). Provide xmin/xmax/ymin/ymax. Adjust color and alpha.
  2. Group the data by tentative status and color those bars differently

CAUTION: You might try geom_rect() to draw a rectangle, but adjusting the transparency does not work in a linelist context. This function overlays one rectangle for each observation/row!. Use either a very low alpha (e.g. 0.01), or another approach.

Using annotate()

  • Within annotate(geom = "rect"), the xmin and xmax arguments must be given inputs of class Date.
  • Note that because these data are aggregated into weekly bars, and the last bar extends to the Monday after the last data point, the shaded region may appear to cover 4 weeks
  • Here is an annotate() online example
ggplot(central_data) + 
  
  # histogram
  geom_histogram(
    mapping = aes(x = date_onset),
    
    breaks = weekly_breaks_central,   # pre-defined date vector - see top of ggplot section
    
    color = "darkblue",
    
    fill = "lightblue") +

  # scales
  scale_y_continuous(expand = c(0,0))+
  scale_x_date(
    expand = c(0,0),                   # remove excess x-axis space below and after case bars
    date_breaks = "1 month",           # 1st of month
    date_minor_breaks = "1 month",     # 1st of month
    date_labels = "%b\n'%y")+          # label format
  
  # labels and theme
  labs(
    title = "Using annotate()\nRectangle and line showing that data from last 21-days are tentative",
    x = "Week of symptom onset",
    y = "Weekly case indicence")+ 
  theme_minimal()+
  
  # add semi-transparent red rectangle to tentative data
  annotate(
    "rect",
    xmin  = as.Date(max(central_data$date_onset, na.rm = T) - 21), # note must be wrapped in as.Date()
    xmax  = as.Date(Inf),                                          # note must be wrapped in as.Date()
    ymin  = 0,
    ymax  = Inf,
    alpha = 0.2,          # alpha easy and intuitive to adjust using annotate()
    fill  = "red")+
  
  # add black vertical line on top of other layers
  annotate(
    "segment",
    x     = max(central_data$date_onset, na.rm = T) - 21, # 21 days before last data
    xend  = max(central_data$date_onset, na.rm = T) - 21, 
    y     = 0,         # line begins at y = 0
    yend  = Inf,       # line to top of plot
    size  = 2,         # line size
    color = "black",
    lty   = "solid")+   # linetype e.g. "solid", "dashed"

  # add text in rectangle
  annotate(
    "text",
    x = max(central_data$date_onset, na.rm = T) - 15,
    y = 15,
    label = "Subject to reporting delays",
    angle = 90)

The same black vertical line can be achieved with the code below, but using geom_vline() you lose the ability to control the height:

geom_vline(xintercept = max(central_data$date_onset, na.rm = T) - 21,
           size = 2,
           color = "black")

Bars color

An alternative approach could be to adjust the color or display of the tentative bars of data themselves. You could create a new column in the data preparation stage and use it to group the data, such that the aes(fill = ) of tentative data can be a different color or alpha than the other bars.

# add column
############
plot_data <- central_data %>% 
  mutate(tentative = case_when(
    date_onset >= max(date_onset, na.rm=T) - 7 ~ "Tentative", # tenative if in last 7 days
    TRUE                                       ~ "Reliable")) # all else reliable

# plot
######
ggplot(plot_data, aes(x = date_onset, fill = tentative)) + 
  
  # histogram
  geom_histogram(
    breaks = weekly_breaks_central,   # pre-defined data vector, see top of ggplot page
    color = "black") +

  # scales
  scale_y_continuous(expand = c(0,0))+
  scale_fill_manual(values = c("lightblue", "grey"))+
  scale_x_date(
    expand = c(0,0),                   # remove excess x-axis space below and after case bars
    date_breaks = "3 weeks",           # Monday every 3 weeks
    date_minor_breaks = "week",        # Monday weeks 
    date_labels = "%d\n%b\n'%y")+      # label format
  
  # labels and theme
  labs(title = "Show days that are tentative reporting",
    subtitle = "")+ 
  theme_minimal()+
  theme(legend.title = element_blank())                 # remove title of legend

32.5 Multi-level date labels

If you want multi-level date labels (e.g. month and year) without duplicating the lower label levels, consider one of the approaches below:

Remember - you can can use tools like \n within the date_labels or labels arguments to put parts of each label on a new line below. However, the code below helps you take years or months (for example) on a lower line and only once. A few notes on the code below:

  • Case counts are aggregated into weeks for aesthetic reasons. See Epicurves page (aggregated data tab) for details.
  • A geom_area() line is used instead of a histogram, as the faceting approach below does not work well with histograms.

Aggregate to weekly counts

# Create dataset of case counts by week
#######################################
central_weekly <- linelist %>%
  filter(hospital == "Central Hospital") %>%   # filter linelist
  mutate(week = lubridate::floor_date(date_onset, unit = "weeks")) %>%  
  count(week) %>%                              # summarize weekly case counts
  drop_na(week) %>%                            # remove cases with missing onset_date
  complete(                                    # fill-in all weeks with no cases reported
    week = seq.Date(
      from = min(week),   
      to   = max(week),
      by   = "week"),
    fill = list(n = 0))                        # convert new NA values to 0 counts

Make plots

# plot with box border on year
##############################
ggplot(central_weekly) +
  geom_area(aes(x = week, y = n),    # make line, specify x and y
            stat = "identity") +             # because line height is count number
  scale_x_date(date_labels="%b",             # date label format show month 
               date_breaks="month",          # date labels on 1st of each month
               expand=c(0,0)) +              # remove excess space on each end
  scale_y_continuous(
    expand  = c(0,0))+                       # remove excess space below x-axis
  facet_grid(~lubridate::year(week), # facet on year (of Date class column)
             space="free_x",                
             scales="free_x",                # x-axes adapt to data range (not "fixed")
             switch="x") +                   # facet labels (year) on bottom
  theme_bw() +
  theme(strip.placement = "outside",         # facet labels placement
        strip.background = element_rect(fill = NA, # facet labels no fill grey border
                                        colour = "grey50"),
        panel.spacing = unit(0, "cm"))+      # no space between facet panels
  labs(title = "Nested year labels, grey label border")

# plot with no box border on year
#################################
ggplot(central_weekly,
       aes(x = week, y = n)) +              # establish x and y for entire plot
  geom_line(stat = "identity",              # make line, line height is count number
            color = "#69b3a2") +            # line color
  geom_point(size=1, color="#69b3a2") +     # make points at the weekly data points
  geom_area(fill = "#69b3a2",               # fill area below line
            alpha = 0.4)+                   # fill transparency
  scale_x_date(date_labels="%b",            # date label format show month 
               date_breaks="month",         # date labels on 1st of each month
               expand=c(0,0)) +             # remove excess space
  scale_y_continuous(
    expand  = c(0,0))+                      # remove excess space below x-axis
  facet_grid(~lubridate::year(week),        # facet on year (of Date class column)
             space="free_x",                
             scales="free_x",               # x-axes adapt to data range (not "fixed")
             switch="x") +                  # facet labels (year) on bottom
  theme_bw() +
  theme(strip.placement = "outside",                     # facet label placement
          strip.background = element_blank(),            # no facet lable background
          panel.grid.minor.x = element_blank(),          
          panel.border = element_rect(colour="grey40"),  # grey border to facet PANEL
          panel.spacing=unit(0,"cm"))+                   # No space between facet panels
  labs(title = "Nested year labels - points, shaded, no label border")

The above techniques were adapted from this and this post on stackoverflow.com.

32.6 Dual-axis

Although there are fierce discussions about the validity of dual axes within the data visualization community, many epi supervisors still want to see an epicurve or similar chart with a percent overlaid with a second axis. This is discussed more extensively in the ggplot tips page, but one example using the cowplot method is shown below:

  • Two distinct plots are made, and then combined with cowplot package.
  • The plots must have the exact same x-axis (set limits) or else the data and labels will not align
  • Each uses theme_cowplot() and one has the y-axis moved to the right side of the plot
#load package
pacman::p_load(cowplot)

# Make first plot of epicurve histogram
#######################################
plot_cases <- linelist %>% 
  
  # plot cases per week
  ggplot()+
  
  # create histogram  
  geom_histogram(
    
    mapping = aes(x = date_onset),
    
    # bin breaks every week beginning monday before first case, going to monday after last case
    breaks = weekly_breaks_all)+  # pre-defined vector of weekly dates (see top of ggplot section)
        
  # specify beginning and end of date axis to align with other plot
  scale_x_date(
    limits = c(min(weekly_breaks_all), max(weekly_breaks_all)))+  # min/max of the pre-defined weekly breaks of histogram
  
  # labels
  labs(
      y = "Daily cases",
      x = "Date of symptom onset"
    )+
  theme_cowplot()


# make second plot of percent died per week
###########################################
plot_deaths <- linelist %>%                        # begin with linelist
  group_by(week = floor_date(date_onset, "week")) %>%  # create week column
  
  # summarise to get weekly percent of cases who died
  summarise(n_cases = n(),
            died = sum(outcome == "Death", na.rm=T),
            pct_died = 100*died/n_cases) %>% 
  
  # begin plot
  ggplot()+
  
  # line of weekly percent who died
  geom_line(                                # create line of percent died
    mapping = aes(x = week, y = pct_died),  # specify y-height as pct_died column
    stat = "identity",                      # set line height to the value in pct_death column, not the number of rows (which is default)
    size = 2,
    color = "black")+
  
  # Same date-axis limits as the other plot - perfect alignment
  scale_x_date(
    limits = c(min(weekly_breaks_all), max(weekly_breaks_all)))+  # min/max of the pre-defined weekly breaks of histogram
  
  
  # y-axis adjustments
  scale_y_continuous(                # adjust y-axis
    breaks = seq(0,100, 10),         # set break intervals of percent axis
    limits = c(0, 100),              # set extent of percent axis
    position = "right")+             # move percent axis to the right
  
  # Y-axis label, no x-axis label
  labs(x = "",
       y = "Percent deceased")+      # percent axis label
  
  theme_cowplot()                   # add this to make the two plots merge together nicely

Now use cowplot to overlay the two plots. Attention has been paid to the x-axis alignment, side of the y-axis, and use of theme_cowplot().

aligned_plots <- cowplot::align_plots(plot_cases, plot_deaths, align="hv", axis="tblr")
ggdraw(aligned_plots[[1]]) + draw_plot(aligned_plots[[2]])

32.7 Cumulative Incidence

Note: If using incidence2, see the section on how you can produce cumulative incidence with a simple function. This page will address how to calculate cumulative incidence and plot it with ggplot().

If beginning with a case linelist, create a new column containing the cumulative number of cases per day in an outbreak using cumsum() from base R:

cumulative_case_counts <- linelist %>% 
  count(date_onset) %>%                # count of rows per day (returned in column "n")   
  mutate(                         
    cumulative_cases = cumsum(n)       # new column of the cumulative number of rows at each date
    )

The first 10 rows are shown below:

This cumulative column can then be plotted against date_onset, using geom_line():

plot_cumulative <- ggplot()+
  geom_line(
    data = cumulative_case_counts,
    aes(x = date_onset, y = cumulative_cases),
    size = 2,
    color = "blue")

plot_cumulative

It can also be overlaid onto the epicurve, with dual-axis using the cowplot method described above and in the ggplot tips page:

#load package
pacman::p_load(cowplot)

# Make first plot of epicurve histogram
plot_cases <- ggplot()+
  geom_histogram(          
    data = linelist,
    aes(x = date_onset),
    binwidth = 1)+
  labs(
    y = "Daily cases",
    x = "Date of symptom onset"
  )+
  theme_cowplot()

# make second plot of cumulative cases line
plot_cumulative <- ggplot()+
  geom_line(
    data = cumulative_case_counts,
    aes(x = date_onset, y = cumulative_cases),
    size = 2,
    color = "blue")+
  scale_y_continuous(
    position = "right")+
  labs(x = "",
       y = "Cumulative cases")+
  theme_cowplot()+
  theme(
    axis.line.x = element_blank(),
    axis.text.x = element_blank(),
    axis.title.x = element_blank(),
    axis.ticks = element_blank())

Now use cowplot to overlay the two plots. Attention has been paid to the x-axis alignment, side of the y-axis, and use of theme_cowplot().

aligned_plots <- cowplot::align_plots(plot_cases, plot_cumulative, align="hv", axis="tblr")
ggdraw(aligned_plots[[1]]) + draw_plot(aligned_plots[[2]])

32.8 Resources

33 Demographic pyramids and Likert-scales

Demographic pyramids are useful to show distributions of age and gender. Similar code can be used to visualize the results of Likert-style survey questions (e.g. “Strongly agree”, “Agree”, “Neutral”, “Disagree”, “Strongly disagree”). In this page we cover the following:

  • Fast & easy pyramids using the apyramid package
  • More customizeable pyramids using ggplot()
  • Displaying “baseline” demographics in the background of the pyramid
  • Using pyramid-style plots to show other types of data (e.g responses to Likert-style survey questions)

33.1 Preparation

Load packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(rio,       # to import data
               here,      # to locate files
               tidyverse, # to clean, handle, and plot the data (includes ggplot2 package)
               apyramid,  # a package dedicated to creating age pyramids
               janitor,   # tables and cleaning data
               stringr)   # working with strings for titles, captions, etc.

Import data

To begin, we import the cleaned linelist of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import data with the import() function from the rio package (it handles many file types like .xlsx, .csv, .rds - see the Import and export page for details).

# import case linelist 
linelist <- import("linelist_cleaned.rds")

The first 50 rows of the linelist are displayed below.

Cleaning

To make a traditional age/gender demographic pyramid, the data must first be cleaned in the following ways:

  • The gender column must be cleaned.
  • Depending on your method, age should be stored as either a numeric or in an age category column.

If using age categories, the column values should be corrected ordered, either by default alpha-numeric or intentionally set by converting to class factor.

Below we use tabyl() from janitor to inspect the columns gender and age_cat5.

linelist %>% 
  tabyl(age_cat5, gender)
##  age_cat5   f   m NA_
##       0-4 640 416  39
##       5-9 641 412  42
##     10-14 518 383  40
##     15-19 359 364  20
##     20-24 305 316  17
##     25-29 163 259  13
##     30-34 104 213   9
##     35-39  42 157   3
##     40-44  25 107   1
##     45-49   8  80   5
##     50-54   2  37   1
##     55-59   0  30   0
##     60-64   0  12   0
##     65-69   0  12   1
##     70-74   0   4   0
##     75-79   0   0   1
##     80-84   0   1   0
##       85+   0   0   0
##      <NA>   0   0  86

We also run a quick histogram on the age column to ensure it is clean and correctly classified:

hist(linelist$age)

33.2 apyramid package

The package apyramid is a product of the R4Epis project. You can read more about this package here. It allows you to quickly make an age pyramid. For more nuanced situations, see the section below using ggplot(). You can read more about the apyramid package in its Help page by entering ?age_pyramid in your R console.

Linelist data

Using the cleaned linelist dataset, we can create an age pyramid with one simple age_pyramid() command. In this command:

  • The data = argument is set as the linelist data frame
  • The age_group = argument (for y-axis) is set to the name of the categorical age column (in quotes)
  • The split_by = argument (for x-axis) is set to the gender column
apyramid::age_pyramid(data = linelist,
                      age_group = "age_cat5",
                      split_by = "gender")

The pyramid can be displayed with percent of all cases on the x-axis, instead of counts, by including proportional = TRUE.

apyramid::age_pyramid(data = linelist,
                      age_group = "age_cat5",
                      split_by = "gender",
                      proportional = TRUE)

When using agepyramid package, if the split_by column is binary (e.g. male/female, or yes/no), then the result will appear as a pyramid. However if there are more than two values in the split_by column (not including NA), the pyramid will appears as a faceted bar plot with grey bars in the “background” indicating the range of the un-faceted data for that age group. In this case, values of split_by = will appear as labels at top of each facet panel. For example, below is what occurs if the split_by = is assigned the column hospital.

apyramid::age_pyramid(data = linelist,
                      age_group = "age_cat5",
                      split_by = "hospital")  

Missing values

Rows that have NA missing values in the split_by = or age_group = columns, if coded as NA, will not trigger the faceting shown above. By default these rows will not be shown. However you can specify that they appear, in an adjacent barplot and as a separate age group at the top, by specifying na.rm = FALSE.

apyramid::age_pyramid(data = linelist,
                      age_group = "age_cat5",
                      split_by = "gender",
                      na.rm = FALSE)         # show patients missing age or gender

Proportions, colors, & aesthetics

By default, the bars display counts (not %), a dashed mid-line for each group is shown, and the colors are green/purple. Each of these parameters can be adjusted, as shown below:

You can also add additional ggplot() commands to the plot using the standard ggplot() “+” syntax, such as aesthetic themes and label adjustments:

apyramid::age_pyramid(
  data = linelist,
  age_group = "age_cat5",
  split_by = "gender",
  proportional = TRUE,              # show percents, not counts
  show_midpoint = FALSE,            # remove bar mid-point line
  #pal = c("orange", "purple")      # can specify alt. colors here (but not labels)
  )+                 
  
  # additional ggplot commands
  theme_minimal()+                               # simplfy background
  scale_fill_manual(                             # specify colors AND labels
    values = c("orange", "purple"),              
    labels = c("m" = "Male", "f" = "Female"))+
  labs(y = "Percent of all cases",              # note x and y labs are switched
       x = "Age categories",                          
       fill = "Gender", 
       caption = "My data source and caption here",
       title = "Title of my plot",
       subtitle = "Subtitle with \n a second line...")+
  theme(
    legend.position = "bottom",                          # legend to bottom
    axis.text = element_text(size = 10, face = "bold"),  # fonts/sizes
    axis.title = element_text(size = 12, face = "bold"))

Aggregated data

The examples above assume your data are in a linelist format, with one row per observation. If your data are already aggregated into counts by age category, you can still use the apyramid package, as shown below.

For demonstration, we aggregate the linelist data into counts by age category and gender, into a “wide” format. This will simulate as if your data were in counts to begin with. Learn more about Grouping data and Pivoting data in their respective pages.

demo_agg <- linelist %>% 
  count(age_cat5, gender, name = "cases") %>% 
  pivot_wider(
    id_cols = age_cat5,
    names_from = gender,
    values_from = cases) %>% 
  rename(`missing_gender` = `NA`)

…which makes the dataset looks like this: with columns for age category, and male counts, female counts, and missing counts.

To set-up these data for the age pyramid, we will pivot the data to be “long” with the pivot_longer() function from dplyr. This is because ggplot() generally prefers “long” data, and apyramid is using ggplot().

# pivot the aggregated data into long format
demo_agg_long <- demo_agg %>% 
  pivot_longer(
    col = c(f, m, missing_gender),            # cols to elongate
    names_to = "gender",                # name for new col of categories
    values_to = "counts") %>%           # name for new col of counts
  mutate(
    gender = na_if(gender, "missing_gender")) # convert "missing_gender" to NA

Then use the split_by = and count = arguments of age_pyramid() to specify the respective columns in the data:

apyramid::age_pyramid(data = demo_agg_long,
                      age_group = "age_cat5",# column name for age category
                      split_by = "gender",   # column name for gender
                      count = "counts")      # column name for case counts

Note in the above, that the factor order of “m” and “f” is different (pyramid reversed). To adjust the order you must re-define gender in the aggregated data as a Factor and order the levels as desired. See the Factors page.

33.3 ggplot()

Using ggplot() to build your age pyramid allows for more flexibility, but requires more effort and understanding of how ggplot() works. It is also easier to accidentally make mistakes.

To use ggplot() to make demographic pyramids, you create two bar plots (one for each gender), convert the values in one plot to negative, and finally flip the x and y axes to display the bar plots vertically, their bases meeting in the plot middle.

Preparation

This approach uses the numeric age column, not the categorical column of age_cat5. So we will check to ensure the class of this column is indeed numeric.

class(linelist$age)
## [1] "numeric"

You could use the same logic below to build a pyramid from categorical data using geom_col() instead of geom_histogram().

Constructing the plot

First, understand that to make such a pyramid using ggplot() the approach is as follows:

  • Within the ggplot(), create two histograms using the numeric age column. Create one for each of the two grouping values (in this case genders male and female). To do this, the data for each histogram are specified within their respective geom_histogram() commands, with the respective filters applied to linelist.

  • One graph will have positive count values, while the other will have its counts converted to negative values - this creates the “pyramid” with the 0 value in the middle of the plot. The negative values are created using a special ggplot2 term ..count.. and multiplying by -1.

  • The command coord_flip() switches the X and Y axes, resulting in the graphs turning vertical and creating the pyramid.

  • Lastly, the counts-axis value labels must be altered so they appear as “positive” counts on both sides of the pyramid (despite the underlying values on one side being negative).

A simple version of this, using geom_histogram(), is below:

  # begin ggplot
  ggplot(mapping = aes(x = age, fill = gender)) +
  
  # female histogram
  geom_histogram(data = linelist %>% filter(gender == "f"),
                 breaks = seq(0,85,5),
                 colour = "white") +
  
  # male histogram (values converted to negative)
  geom_histogram(data = linelist %>% filter(gender == "m"),
                 breaks = seq(0,85,5),
                 mapping = aes(y = ..count..*(-1)),
                 colour = "white") +
  
  # flip the X and Y axes
  coord_flip() +
  
  # adjust counts-axis scale
  scale_y_continuous(limits = c(-600, 900),
                     breaks = seq(-600,900,100),
                     labels = abs(seq(-600, 900, 100)))

DANGER: If the limits of your counts axis are set too low, and a counts bar exceeds them, the bar will disappear entirely or be artificially shortened! Watch for this if analyzing data which is routinely updated. Prevent it by having your count-axis limits auto-adjust to your data, as below.

There are many things you can change/add to this simple version, including:

  • Auto adjust counts-axis scale to your data (avoid errors discussed in warning below)
  • Manually specify colors and legend labels

Convert counts to percents

To convert counts to percents (of total), do this in your data prior to plotting. Below, we get the age-gender counts, then ungroup(), and then mutate() to create new percent columns. If you want percents by gender, skip the ungroup step.

# create dataset with proportion of total
pyramid_data <- linelist %>%
  count(age_cat5,
        gender,
        name = "counts") %>% 
  ungroup() %>%                 # ungroup so percents are not by group
  mutate(percent = round(100*(counts / sum(counts, na.rm=T)), digits = 1), 
         percent = case_when(
            gender == "f" ~ percent,
            gender == "m" ~ -percent,     # convert male to negative
            TRUE          ~ NA_real_))    # NA val must by numeric as well

Importantly, we save the max and min values so we know what the limits of the scale should be. These will be used in the ggplot() command below.

max_per <- max(pyramid_data$percent, na.rm=T)
min_per <- min(pyramid_data$percent, na.rm=T)

max_per
## [1] 10.9
min_per
## [1] -7.1

Finally we make the ggplot() on the percent data. We specify scale_y_continuous() to extend the pre-defined lengths in each direction (positive and “negative”). We use floor() and ceiling() to round decimals the appropriate direction (down or up) for the side of the axis.

# begin ggplot
  ggplot()+  # default x-axis is age in years;

  # case data graph
  geom_col(data = pyramid_data,
           mapping = aes(
             x = age_cat5,
             y = percent,
             fill = gender),         
           colour = "white")+       # white around each bar
  
  # flip the X and Y axes to make pyramid vertical
  coord_flip()+
  

  # adjust the axes scales
  # scale_x_continuous(breaks = seq(0,100,5), labels = seq(0,100,5)) +
  scale_y_continuous(
    limits = c(min_per, max_per),
    breaks = seq(from = floor(min_per),                # sequence of values, by 2s
                 to = ceiling(max_per),
                 by = 2),
    labels = paste0(abs(seq(from = floor(min_per),     # sequence of absolute values, by 2s, with "%"
                            to = ceiling(max_per),
                            by = 2)),
                    "%"))+  

  # designate colors and legend labels manually
  scale_fill_manual(
    values = c("f" = "orange",
               "m" = "darkgreen"),
    labels = c("Female", "Male")) +
  
  # label values (remember X and Y flipped now)
  labs(
    title = "Age and gender of cases",
    x = "Age group",
    y = "Percent of total",
    fill = NULL,
    caption = stringr::str_glue("Data are from linelist \nn = {nrow(linelist)} (age or sex missing for {sum(is.na(linelist$gender) | is.na(linelist$age_years))} cases) \nData as of: {format(Sys.Date(), '%d %b %Y')}")) +
  
  # display themes
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_blank(),
    axis.line = element_line(colour = "black"),
    plot.title = element_text(hjust = 0.5), 
    plot.caption = element_text(hjust=0, size=11, face = "italic")
    )

Compare to baseline

With the flexibility of ggplot(), you can have a second layer of bars in the background that represent the “true” or “baseline” population pyramid. This can provide a nice visualization to compare the observed with the baseline.

Import and view the population data (see Download handbook and data page):

# import the population demographics data
pop <- rio::import("country_demographics.csv")

First some data management steps:

Here we record the order of age categories that we want to appear. Due to some quirks the way the ggplot() is implemented, in this specific scenario it is easiest to store these as a character vector and use them later in the plotting function.

# record correct age cat levels
age_levels <- c("0-4","5-9", "10-14", "15-19", "20-24",
                "25-29","30-34", "35-39", "40-44", "45-49",
                "50-54", "55-59", "60-64", "65-69", "70-74",
                "75-79", "80-84", "85+")

Combine the population and case data through the dplyr function bind_rows():

  • First, ensure they have the exact same column names, age categories values, and gender values
  • Make them have the same data structure: columns of age category, gender, counts, and percent of total
  • Bind them together, one on-top of the other (bind_rows())
# create/transform populaton data, with percent of total
########################################################
pop_data <- pop %>% 
  pivot_longer(      # pivot gender columns longer
    cols = c(m, f),
    names_to = "gender",
    values_to = "counts") %>% 
  
  mutate(
    percent  = round(100*(counts / sum(counts, na.rm=T)),1),  # % of total
    percent  = case_when(                                                        
     gender == "f" ~ percent,
     gender == "m" ~ -percent,               # if male, convert % to negative
     TRUE          ~ NA_real_))

Review the changed population dataset

Now implement the same for the case linelist. Slightly different because it begins with case-rows, not counts.

# create case data by age/gender, with percent of total
#######################################################
case_data <- linelist %>%
  count(age_cat5, gender, name = "counts") %>%  # counts by age-gender groups
  ungroup() %>% 
  mutate(
    percent = round(100*(counts / sum(counts, na.rm=T)),1),  # calculate % of total for age-gender groups
    percent = case_when(                                     # convert % to negative if male
      gender == "f" ~ percent,
      gender == "m" ~ -percent,
      TRUE          ~ NA_real_))

Review the changed case dataset

Now the two data frames are combined, one on top of the other (they have the same column names). We can “name” each of the data frame, and use the .id = argument to create a new column “data_source” that will indicate which data frame each row originated from. We can use this column to filter in the ggplot().

# combine case and population data (same column names, age_cat values, and gender values)
pyramid_data <- bind_rows("cases" = case_data, "population" = pop_data, .id = "data_source")

Store the maximum and minimum percent values, used in the plotting function to define the extent of the plot (and not cut short any bars!)

# Define extent of percent axis, used for plot limits
max_per <- max(pyramid_data$percent, na.rm=T)
min_per <- min(pyramid_data$percent, na.rm=T)

Now the plot is made with ggplot():

  • One bar graph of population data (wider, more transparent bars)
  • One bar graph of case data (small, more solid bars)
# begin ggplot
##############
ggplot()+  # default x-axis is age in years;

  # population data graph
  geom_col(
    data = pyramid_data %>% filter(data_source == "population"),
    mapping = aes(
      x = age_cat5,
      y = percent,
      fill = gender),
    colour = "black",                               # black color around bars
    alpha = 0.2,                                    # more transparent
    width = 1)+                                     # full width
  
  # case data graph
  geom_col(
    data = pyramid_data %>% filter(data_source == "cases"), 
    mapping = aes(
      x = age_cat5,                               # age categories as original X axis
      y = percent,                                # % as original Y-axis
      fill = gender),                             # fill of bars by gender
    colour = "black",                               # black color around bars
    alpha = 1,                                      # not transparent 
    width = 0.3)+                                   # half width
  
  # flip the X and Y axes to make pyramid vertical
  coord_flip()+
  
  # manually ensure that age-axis is ordered correctly
  scale_x_discrete(limits = age_levels)+     # defined in chunk above
  
  # set percent-axis 
  scale_y_continuous(
    limits = c(min_per, max_per),                                          # min and max defined above
    breaks = seq(floor(min_per), ceiling(max_per), by = 2),                # from min% to max% by 2 
    labels = paste0(                                                       # for the labels, paste together... 
              abs(seq(floor(min_per), ceiling(max_per), by = 2)), "%"))+                                                  

  # designate colors and legend labels manually
  scale_fill_manual(
    values = c("f" = "orange",         # assign colors to values in the data
               "m" = "darkgreen"),
    labels = c("f" = "Female",
               "m"= "Male"),      # change labels that appear in legend, note order
  ) +

  # plot labels, titles, caption    
  labs(
    title = "Case age and gender distribution,\nas compared to baseline population",
    subtitle = "",
    x = "Age category",
    y = "Percent of total",
    fill = NULL,
    caption = stringr::str_glue("Cases shown on top of country demographic baseline\nCase data are from linelist, n = {nrow(linelist)}\nAge or gender missing for {sum(is.na(linelist$gender) | is.na(linelist$age_years))} cases\nCase data as of: {format(max(linelist$date_onset, na.rm=T), '%d %b %Y')}")) +
  
  # optional aesthetic themes
  theme(
    legend.position = "bottom",                             # move legend to bottom
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_blank(),
    axis.line = element_line(colour = "black"),
    plot.title = element_text(hjust = 0), 
    plot.caption = element_text(hjust=0, size=11, face = "italic"))

33.4 Likert scale

The techniques used to make a population pyramid with ggplot() can also be used to make plots of Likert-scale survey data.

Import the data (see Download handbook and data page if desired).

# import the likert survey response data
likert_data <- rio::import("likert_data.csv")

Start with data that looks like this, with a categorical classification of each respondent (status) and their answers to 8 questions on a 4-point Likert-type scale (“Very poor”, “Poor”, “Good”, “Very good”).

First, some data management steps:

  • Pivot the data longer
  • Create new column direction depending on whether response was generally “positive” or “negative”
  • Set the Factor level order for the status column and the Response column
  • Store the max count value so limits of plot are appropriate
melted <- likert_data %>% 
  pivot_longer(
    cols = Q1:Q8,
    names_to = "Question",
    values_to = "Response") %>% 
  mutate(
    
    direction = case_when(
      Response %in% c("Poor","Very Poor")  ~ "Negative",
      Response %in% c("Good", "Very Good") ~ "Positive",
      TRUE                                 ~ "Unknown"),
    
    status = fct_relevel(status, "Junior", "Intermediate", "Senior"),
    
    # must reverse 'Very Poor' and 'Poor' for ordering to work
    Response = fct_relevel(Response, "Very Good", "Good", "Very Poor", "Poor")) 

# get largest value for scale limits
melted_max <- melted %>% 
  count(status, Question) %>% # get counts
  pull(n) %>%                 # column 'n'
  max(na.rm=T)                # get max

Now make the plot. As in the age pyramids above, we are creating two bar plots and inverting the values of one of them to negative.

We use geom_bar() because our data are one row per observation, not aggregated counts. We use the special ggplot2 term ..count.. in one of the bar plots to invert the values negative (*-1), and we set position = "stack" so the values stack on top of each other.

# make plot
ggplot()+
     
  # bar graph of the "negative" responses 
     geom_bar(
       data = melted %>% filter(direction == "Negative"),
       mapping = aes(
         x = status,
         y = ..count..*(-1),    # counts inverted to negative
         fill = Response),
       color = "black",
       closed = "left",
       position = "stack")+
     
     # bar graph of the "positive responses
     geom_bar(
       data = melted %>% filter(direction == "Positive"),
       mapping = aes(
         x = status,
         fill = Response),
       colour = "black",
       closed = "left",
       position = "stack")+
     
     # flip the X and Y axes
     coord_flip()+
  
     # Black vertical line at 0
     geom_hline(yintercept = 0, color = "black", size=1)+
     
    # convert labels to all positive numbers
    scale_y_continuous(
      
      # limits of the x-axis scale
      limits = c(-ceiling(melted_max/10)*11,    # seq from neg to pos by 10, edges rounded outward to nearest 5
                 ceiling(melted_max/10)*10),   
      
      # values of the x-axis scale
      breaks = seq(from = -ceiling(melted_max/10)*10,
                   to = ceiling(melted_max/10)*10,
                   by = 10),
      
      # labels of the x-axis scale
      labels = abs(unique(c(seq(-ceiling(melted_max/10)*10, 0, 10),
                            seq(0, ceiling(melted_max/10)*10, 10))))) +
     
    # color scales manually assigned 
    scale_fill_manual(
      values = c("Very Good"  = "green4", # assigns colors
                "Good"      = "green3",
                "Poor"      = "yellow",
                "Very Poor" = "red3"),
      breaks = c("Very Good", "Good", "Poor", "Very Poor"))+ # orders the legend
     
    
     
    # facet the entire plot so each question is a sub-plot
    facet_wrap( ~ Question, ncol = 3)+
     
    # labels, titles, caption
    labs(
      title = str_glue("Likert-style responses\nn = {nrow(likert_data)}"),
      x = "Respondent status",
      y = "Number of responses",
      fill = "")+

     # display adjustments 
     theme_minimal()+
     theme(axis.text = element_text(size = 12),
           axis.title = element_text(size = 14, face = "bold"),
           strip.text = element_text(size = 14, face = "bold"),  # facet sub-titles
           plot.title = element_text(size = 20, face = "bold"),
           panel.background = element_rect(fill = NA, color = "black")) # black box around each facet

33.5 Resources

apyramid documentation

34 Heat plots

Heat plots, also known as “heat maps” or “heat tiles”, can be useful visualizations when trying to display 3 variables (x-axis, y-axis, and fill). Below we demonstrate two examples:

  • A visual matrix of transmission events by age (“who infected whom”)
  • Tracking reporting metrics across many facilities/jurisdictions over time

34.1 Preparation

Load packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
  tidyverse,       # data manipulation and visualization
  rio,             # importing data 
  lubridate        # working with dates
  )

Datasets

This page utilizes the case linelist of a simulated outbreak for the transmission matrix section, and a separate dataset of daily malaria case counts by facility for the metrics tracking section. They are loaded and cleaned in their individual sections.

34.2 Transmission matrix

Heat tiles can be useful to visualize matrices. One example is to display “who-infected-whom” in an outbreak. This assumes that you have information on transmission events.

Note that the Contact tracing page contains another example of making a heat tile contact matrix, using a different (perhaps more simple) dataset where the ages of cases and their sources are neatly aligned in the same row of the data frame. This same data is used to make a density map in the ggplot tips page. This example below begins from a case linelist and so involves considerable data manipulation prior to achieving a plotable data frame. So there are many scenarios to chose from…

We begin from the case linelist of a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import your data with the import() function from the rio package (it accepts many file types like .xlsx, .rds, .csv - see the Import and export page for details).

The first 50 rows of the linelist are shown below for demonstration:

linelist <- import("linelist_cleaned.rds")

In this linelist:

  • There is one row per case, as identified by case_id
  • There is a later column infector that contains the case_id of the infector, who is also a case in the linelist

Data preparation

Objective: We need to achieve a “long”-style data frame that contains one row per possible age-to-age transmission route, with a numeric column containing that row’s proportion of all observed transmission events in the linelist.

This will take several data manuipulation steps to achieve:

Make cases data frame

To begin, we create a data frame of the cases, their ages, and their infectors - we call the data frame case_ages. The first 50 rows are displayed below.

case_ages <- linelist %>% 
  select(case_id, infector, age_cat) %>% 
  rename("case_age_cat" = "age_cat")

Make infectors data frame

Next, we create a data frame of the infectors - at the moment it consists of a single column. These are the infector IDs from the linelist. Not every case has a known infector, so we remove missing values. The first 50 rows are displayed below.

infectors <- linelist %>% 
  select(infector) %>% 
  drop_na(infector)

Next, we use joins to procure the ages of the infectors. This is not simple, because in the linelist, the infector’s ages are not listed as such. We achieve this result by joining the case linelist to the infectors. We begin with the infectors, and left_join() (add) the case linelist such that the infector id column left-side “baseline” data frame joins to the case_id column in the right-side linelist data frame.

Thus, the data from the infector’s case record in the linelist (including age) is added to the infector row. The 50 first rows are displayed below.

infector_ages <- infectors %>%             # begin with infectors
  left_join(                               # add the linelist data to each infector  
    linelist,
    by = c("infector" = "case_id")) %>%    # match infector to their information as a case
  select(infector, age_cat) %>%            # keep only columns of interest
  rename("infector_age_cat" = "age_cat")   # rename for clarity

Then, we combine the cases and their ages with the infectors and their ages. Each of these data frame has the column infector, so it is used for the join. The first rows are displayed below:

ages_complete <- case_ages %>%  
  left_join(
    infector_ages,
    by = "infector") %>%        # each has the column infector
  drop_na()                     # drop rows with any missing data

Below, a simple cross-tabulation of counts between the case and infector age groups. Labels added for clarity.

table(cases = ages_complete$case_age_cat,
      infectors = ages_complete$infector_age_cat)
##        infectors
## cases   0-4 5-9 10-14 15-19 20-29 30-49 50-69 70+
##   0-4   105 156   105   114   143   117    13   0
##   5-9   102 132   110   102   117    96    12   5
##   10-14 104 109    91    79   120    80    12   4
##   15-19  85 105    82    39    75    69     7   5
##   20-29 101 127   109    80   143   107    22   4
##   30-49  72  97    56    54    98    61     4   5
##   50-69   5   6    15     9     7     5     2   0
##   70+     1   0     2     0     0     0     0   0

We can convert this table to a data frame with data.frame() from base R, which also automatically converts it to “long” format, which is desired for the ggplot(). The first rows are shown below.

long_counts <- data.frame(table(
    cases     = ages_complete$case_age_cat,
    infectors = ages_complete$infector_age_cat))

Now we do the same, but apply prop.table() from base R to the table so instead of counts we get proportions of the total. The first 50 rows are shown below.

long_prop <- data.frame(prop.table(table(
    cases = ages_complete$case_age_cat,
    infectors = ages_complete$infector_age_cat)))

Create heat plot

Now finally we can create the heat plot with ggplot2 package, using the geom_tile() function. See the ggplot tips page to learn more extensively about color/fill scales, especially the scale_fill_gradient() function.

  • In the aesthetics aes() of geom_tile() set the x and y as the case age and infector age
  • Also in aes() set the argument fill = to the Freq column - this is the value that will be converted to a tile color
  • Set a scale color with scale_fill_gradient() - you can specify the high/low colors
    • Note that scale_color_gradient() is different! In this case you want the fill
  • Because the color is made via “fill”, you can use the fill = argument in labs() to change the legend title
ggplot(data = long_prop)+       # use long data, with proportions as Freq
  geom_tile(                    # visualize it in tiles
    aes(
      x = cases,         # x-axis is case age
      y = infectors,     # y-axis is infector age
      fill = Freq))+            # color of the tile is the Freq column in the data
  scale_fill_gradient(          # adjust the fill color of the tiles
    low = "blue",
    high = "orange")+
  labs(                         # labels
    x = "Case age",
    y = "Infector age",
    title = "Who infected whom",
    subtitle = "Frequency matrix of transmission events",
    fill = "Proportion of all\ntranmsission events"     # legend title
  )

34.3 Reporting metrics over time

Often in public health, one objective is to assess trends over time for many entities (facilities, jurisdictions, etc.). One way to visualize such trends over time is a heat plot where the x-axis is time and on the y-axis are the many entities.

Data preparation

We begin by importing a dataset of daily malaria reports from many facilities. The reports contain a date, province, district, and malaria counts. See the page on Download handbook and data for information on how to download these data. Below are the first 30 rows:

facility_count_data <- import("malaria_facility_count_data.rds")

Aggregate and summarize

The objective in this example is to transform the daily facility total malaria case counts (seen in previous tab) into weekly summary statistics of facility reporting performance - in this case the proportion of days per week that the facility reported any data. For this example we will show data only for Spring District.

To achieve this we will do the following data management steps:

  1. Filter the data as appropriate (by place, date)
  2. Create a week column using floor_date() from package lubridate
    • This function returns the start-date of a given date’s week, using a specified start date of each week (e.g. “Mondays”)
  3. The data are grouped by columns “location” and “week” to create analysis units of “facility-week”
  4. The function summarise() creates new columns to reflecting summary statistics per facility-week group:
    • Number of days per week (7 - a static value)
    • Number of reports received from the facility-week (could be more than 7!)
    • Sum of malaria cases reported by the facility-week (just for interest)
    • Number of unique days in the facility-week for which there is data reported
    • Percent of the 7 days per facility-week for which data was reported
  5. The data frame is joined with right_join() to a comprehensive list of all possible facility-week combinations, to make the dataset complete. The matrix of all possible combinations is created by applying expand() to those two columns of the data frame as it is at that moment in the pipe chain (represented by .). Because a right_join() is used, all rows in the expand() data frame are kept, and added to agg_weeks if necessary. These new rows appear with NA (missing) summarized values.

Below we demonstrate step-by-step:

# Create weekly summary dataset
agg_weeks <- facility_count_data %>% 
  
  # filter the data as appropriate
  filter(
    District == "Spring",
    data_date < as.Date("2020-08-01")) 

Now the dataset has nrow(agg_weeks) rows, when it previously had nrow(facility_count_data).

Next we create a week column reflecting the start date of the week for each record. This is achieved with the lubridate package and the function floor_date(), which is set to “week” and for the weeks to begin on Mondays (day 1 of the week - Sundays would be 7). The top rows are shown below.

agg_weeks <- agg_weeks %>% 
  # Create week column from data_date
  mutate(
    week = lubridate::floor_date(                     # create new column of weeks
      data_date,                                      # date column
      unit = "week",                                  # give start of the week
      week_start = 1))                                # weeks to start on Mondays 

The new week column can be seen on the far right of the data frame

Now we group the data into facility-weeks and summarise them to produce statistics per facility-week. See the page on Descriptive tables for tips. The grouping itself doesn’t change the data frame, but it impacts how the subsequent summary statistics are calculated.

The top rows are shown below. Note how the columns have completely changed to reflect the desired summary statistics. Each row reflects one facility-week.

agg_weeks <- agg_weeks %>%   

  # Group into facility-weeks
  group_by(location_name, week) %>%
  
  # Create summary statistics columns on the grouped data
  summarize(
    n_days          = 7,                                          # 7 days per week           
    n_reports       = dplyr::n(),                                 # number of reports received per week (could be >7)
    malaria_tot     = sum(malaria_tot, na.rm = T),                # total malaria cases reported
    n_days_reported = length(unique(data_date)),                  # number of unique days reporting per week
    p_days_reported = round(100*(n_days_reported / n_days)))      # percent of days reporting

Finally, we run the command below to ensure that ALL possible facility-weeks are present in the data, even if they were missing before.

We are using a right_join() on itself (the dataset is represented by “.”) but having been expanded to include all possible combinations of the columns week and location_name. See documentation on the expand() function in the page on [Pivoting]. Before running this code the dataset contains nrow(agg_weeks) rows.

# Create data frame of every possible facility-week
expanded_weeks <- agg_weeks %>% 
  mutate(week = as.factor(week)) %>%         # convert date to a factor so expand() works correctly
  tidyr::expand(., week, location_name) %>%  # expand data frame to include all possible facility-week combinations
                                             # note: "." represents the dataset at that moment in the pipe chain
  mutate(week = as.Date(week))               # re-convert week to class Date so the subsequent right_join works

Here is expanded_weeks:

Before running this code, agg_weeks contains nrow(agg_weeks) rows.

# Use a right-join with the expanded facility-week list to fill-in the missing gaps in the data
agg_weeks <- agg_weeks %>%      
  right_join(expanded_weeks) %>%                            # Ensure every possible facility-week combination appears in the data
  mutate(p_days_reported = replace_na(p_days_reported, 0))  # convert missing values to 0                           
## Joining, by = c("location_name", "week")

After running this code, agg_weeks contains nrow(agg_weeks) rows.

Create heat plot

The ggplot() is made using geom_tile() from the ggplot2 package:

  • Weeks on the x-axis is transformed to dates, allowing use of scale_x_date()
  • location_name on the y-axis will show all facility names
  • The fill is p_days_reported, the performance for that facility-week (numeric)
  • scale_fill_gradient() is used on the numeric fill, specifying colors for high, low, and NA
  • scale_x_date() is used on the x-axis specifying labels every 2 weeks and their format
  • Display themes and labels can be adjusted as necessary

Basic

A basic heat plot is produced below, using the default colors, scales, etc. As explained above, within the aes() for geom_tile() you must provide an x-axis column, y-axis column, and a column for the the fill =. The fill is the numeric value that presents as tile color.

ggplot(data = agg_weeks)+
  geom_tile(
    aes(x = week,
        y = location_name,
        fill = p_days_reported))

Cleaned plot

We can make this plot look better by adding additional ggplot2 functions, as shown below. See the page on ggplot tips for details.

ggplot(data = agg_weeks)+ 
  
  # show data as tiles
  geom_tile(
    aes(x = week,
        y = location_name,
        fill = p_days_reported),      
    color = "white")+                 # white gridlines
  
  scale_fill_gradient(
    low = "orange",
    high = "darkgreen",
    na.value = "grey80")+
  
  # date axis
  scale_x_date(
    expand = c(0,0),             # remove extra space on sides
    date_breaks = "2 weeks",     # labels every 2 weeks
    date_labels = "%d\n%b")+     # format is day over month (\n in newline)
  
  # aesthetic themes
  theme_minimal()+                                  # simplify background
  
  theme(
    legend.title = element_text(size=12, face="bold"),
    legend.text  = element_text(size=10, face="bold"),
    legend.key.height = grid::unit(1,"cm"),           # height of legend key
    legend.key.width  = grid::unit(0.6,"cm"),         # width of legend key
    
    axis.text.x = element_text(size=12),              # axis text size
    axis.text.y = element_text(vjust=0.2),            # axis text alignment
    axis.ticks = element_line(size=0.4),               
    axis.title = element_text(size=12, face="bold"),  # axis title size and bold
    
    plot.title = element_text(hjust=0,size=14,face="bold"),  # title right-aligned, large, bold
    plot.caption = element_text(hjust = 0, face = "italic")  # caption right-aligned and italic
    )+
  
  # plot labels
  labs(x = "Week",
       y = "Facility name",
       fill = "Reporting\nperformance (%)",           # legend title, because legend shows fill
       title = "Percent of days per week that facility reported data",
       subtitle = "District health facilities, May-July 2020",
       caption = "7-day weeks beginning on Mondays.")

Ordered y-axis

Currently, the facilities are ordered “alpha-numerically” from the bottom to the top. If you want to adjust the order the y-axis facilities, convert them to class factor and provide the order. See the page on Factors for tips.

Since there are many facilities and we don’t want to write them all out, we will try another approach - ordering the facilities in a data frame and using the resulting column of names as the factor level order. Below, the column location_name is converted to a factor, and the order of its levels is set based on the total number of reporting days filed by the facility across the whole time-span.

To do this, we create a data frame which represents the total number of reports per facility, arranged in ascending order. We can use this vector to order the factor levels in the plot.

facility_order <- agg_weeks %>% 
  group_by(location_name) %>% 
  summarize(tot_reports = sum(n_days_reported, na.rm=T)) %>% 
  arrange(tot_reports) # ascending order

See the data frame below:

Now use a column from the above data frame (facility_order$location_name) to be the order of the factor levels of location_name in the data frame agg_weeks:

# load package 
pacman::p_load(forcats)

# create factor and define levels manually
agg_weeks <- agg_weeks %>% 
  mutate(location_name = fct_relevel(
    location_name, facility_order$location_name)
    )

And now the data are re-plotted, with location_name being an ordered factor:

ggplot(data = agg_weeks)+ 
  
  # show data as tiles
  geom_tile(
    aes(x = week,
        y = location_name,
        fill = p_days_reported),      
    color = "white")+                 # white gridlines
  
  scale_fill_gradient(
    low = "orange",
    high = "darkgreen",
    na.value = "grey80")+
  
  # date axis
  scale_x_date(
    expand = c(0,0),             # remove extra space on sides
    date_breaks = "2 weeks",     # labels every 2 weeks
    date_labels = "%d\n%b")+     # format is day over month (\n in newline)
  
  # aesthetic themes
  theme_minimal()+                                  # simplify background
  
  theme(
    legend.title = element_text(size=12, face="bold"),
    legend.text  = element_text(size=10, face="bold"),
    legend.key.height = grid::unit(1,"cm"),           # height of legend key
    legend.key.width  = grid::unit(0.6,"cm"),         # width of legend key
    
    axis.text.x = element_text(size=12),              # axis text size
    axis.text.y = element_text(vjust=0.2),            # axis text alignment
    axis.ticks = element_line(size=0.4),               
    axis.title = element_text(size=12, face="bold"),  # axis title size and bold
    
    plot.title = element_text(hjust=0,size=14,face="bold"),  # title right-aligned, large, bold
    plot.caption = element_text(hjust = 0, face = "italic")  # caption right-aligned and italic
    )+
  
  # plot labels
  labs(x = "Week",
       y = "Facility name",
       fill = "Reporting\nperformance (%)",           # legend title, because legend shows fill
       title = "Percent of days per week that facility reported data",
       subtitle = "District health facilities, May-July 2020",
       caption = "7-day weeks beginning on Mondays.")

Display values

You can add a geom_text() layer on top of the tiles, to display the actual numbers of each tile. Be aware this may not look pretty if you have many small tiles!

The following code has been added: geom_text(aes(label = p_days_reported)). This adds text onto every tile. The text displayed is the value assigned to the argument label =, which in this case has been set to the same numeric column p_days_reported that is also used to create the color gradient.

ggplot(data = agg_weeks)+ 
  
  # show data as tiles
  geom_tile(
    aes(x = week,
        y = location_name,
        fill = p_days_reported),      
    color = "white")+                 # white gridlines
  
  # text
  geom_text(
    aes(
      x = week,
      y = location_name,
      label = p_days_reported))+      # add text on top of tile
  
  # fill scale
  scale_fill_gradient(
    low = "orange",
    high = "darkgreen",
    na.value = "grey80")+
  
  # date axis
  scale_x_date(
    expand = c(0,0),             # remove extra space on sides
    date_breaks = "2 weeks",     # labels every 2 weeks
    date_labels = "%d\n%b")+     # format is day over month (\n in newline)
  
  # aesthetic themes
  theme_minimal()+                                    # simplify background
  
  theme(
    legend.title = element_text(size=12, face="bold"),
    legend.text  = element_text(size=10, face="bold"),
    legend.key.height = grid::unit(1,"cm"),           # height of legend key
    legend.key.width  = grid::unit(0.6,"cm"),         # width of legend key
    
    axis.text.x = element_text(size=12),              # axis text size
    axis.text.y = element_text(vjust=0.2),            # axis text alignment
    axis.ticks = element_line(size=0.4),               
    axis.title = element_text(size=12, face="bold"),  # axis title size and bold
    
    plot.title = element_text(hjust=0,size=14,face="bold"),  # title right-aligned, large, bold
    plot.caption = element_text(hjust = 0, face = "italic")  # caption right-aligned and italic
    )+
  
  # plot labels
  labs(x = "Week",
       y = "Facility name",
       fill = "Reporting\nperformance (%)",           # legend title, because legend shows fill
       title = "Percent of days per week that facility reported data",
       subtitle = "District health facilities, May-July 2020",
       caption = "7-day weeks beginning on Mondays.")

35 Diagrams and charts

This page covers code to produce:

  • Flow diagrams using DiagrammeR and the DOT language
  • Alluvial/Sankey diagrams
  • Event timelines

35.1 Preparation

Load packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
  DiagrammeR,     # for flow diagrams
  networkD3,      # For alluvial/Sankey diagrams
  tidyverse)      # data management and visualization

Import data

Most of the content in this page does not require a dataset. However, in the Sankey diagram section, we will use the case linelist from a simulated Ebola epidemic. If you want to follow along for this part, click to download the “clean” linelist (as .rds file). Import data with the import() function from the rio package (it handles many file types like .xlsx, .csv, .rds - see the Import and export page for details).

# import the linelist
linelist <- import("linelist_cleaned.rds")

The first 50 rows of the linelist are displayed below.

35.2 Flow diagrams

One can use the R package DiagrammeR to create charts/flow charts. They can be static, or they can adjust somewhat dynamically based on changes in a dataset.

Tools

The function grViz() is used to create a “Graphviz” diagram. This function accepts a character string input containing instructions for making the diagram. Within that string, the instructions are written in a different language, called DOT - it is quite easy to learn the basics.

Basic structure

  1. Open the instructions grViz("
  2. Specify directionality and name of the graph, and open brackets, e.g. digraph my_flow_chart {
  3. Graph statement (layout, rank direction)
  4. Nodes statements (create nodes)
  5. Edges statements (gives links between nodes)
  6. Close the instructions }")

Simple examples

Below are two simple examples

A very minimal example:

# A minimal plot
DiagrammeR::grViz("digraph {
  
graph[layout = dot, rankdir = LR]

a
b
c

a -> b -> c
}")

An example with perhaps a bit more applied public health context:

grViz("                           # All instructions are within a large character string
digraph surveillance_diagram {    # 'digraph' means 'directional graph', then the graph name 
  
  # graph statement
  #################
  graph [layout = dot,
         rankdir = TB,
         overlap = true,
         fontsize = 10]
  
  # nodes
  #######
  node [shape = circle,           # shape = circle
       fixedsize = true
       width = 1.3]               # width of circles
  
  Primary                         # names of nodes
  Secondary
  Tertiary

  # edges
  #######
  Primary   -> Secondary [label = ' case transfer']
  Secondary -> Tertiary [label = ' case transfer']
}
")

Syntax

Basic syntax

Node names, or edge statements, can be separated with spaces, semicolons, or newlines.

Rank direction

A plot can be re-oriented to move left-to-right by adjusting the rankdir argument within the graph statement. The default is TB (top-to-bottom), but it can be LR (left-to-right), RL, or BT.

Node names

Node names can be single words, as in the simple example above. To use multi-word names or special characters (e.g. parentheses, dashes), put the node name within single quotes (’ ’). It may be easier to have a short node name, and assign a label, as shown below within brackets [ ]. If you want to have a newline within the node’s name, you must do it via a label - use \n in the node label within single quotes, as shown below.

Subgroups
Within edge statements, subgroups can be created on either side of the edge with curly brackets ({ }). The edge then applies to all nodes in the bracket - it is a shorthand.

Layouts

  • dot (set rankdir to either TB, LR, RL, BT, )
  • neato
  • twopi
  • circo

Nodes - editable attributes

  • label (text, in single quotes if multi-word)
  • fillcolor (many possible colors)
  • fontcolor
  • alpha (transparency 0-1)
  • shape (ellipse, oval, diamond, egg, plaintext, point, square, triangle)
  • style
  • sides
  • peripheries
  • fixedsize (h x w)
  • height
  • width
  • distortion
  • penwidth (width of shape border)
  • x (displacement left/right)
  • y (displacement up/down)
  • fontname
  • fontsize
  • icon

Edges - editable attributes

  • arrowsize
  • arrowhead (normal, box, crow, curve, diamond, dot, inv, none, tee, vee)
  • arrowtail
  • dir (direction, )
  • style (dashed, …)
  • color
  • alpha
  • headport (text in front of arrowhead)
  • tailport (text in behind arrowtail)
  • fontname
  • fontsize
  • fontcolor
  • penwidth (width of arrow)
  • minlen (minimum length)

Color names: hexadecimal values or ‘X11’ color names, see here for X11 details

Complex examples

The example below expands on the surveillance_diagram, adding complex node names, grouped edges, colors and styling

DiagrammeR::grViz("               # All instructions are within a large character string
digraph surveillance_diagram {    # 'digraph' means 'directional graph', then the graph name 
  
  # graph statement
  #################
  graph [layout = dot,
         rankdir = TB,            # layout top-to-bottom
         fontsize = 10]
  

  # nodes (circles)
  #################
  node [shape = circle,           # shape = circle
       fixedsize = true
       width = 1.3]                      
  
  Primary   [label = 'Primary\nFacility'] 
  Secondary [label = 'Secondary\nFacility'] 
  Tertiary  [label = 'Tertiary\nFacility'] 
  SC        [label = 'Surveillance\nCoordination',
             fontcolor = darkgreen] 
  
  # edges
  #######
  Primary   -> Secondary [label = ' case transfer',
                          fontcolor = red,
                          color = red]
  Secondary -> Tertiary [label = ' case transfer',
                          fontcolor = red,
                          color = red]
  
  # grouped edge
  {Primary Secondary Tertiary} -> SC [label = 'case reporting',
                                      fontcolor = darkgreen,
                                      color = darkgreen,
                                      style = dashed]
}
")

Sub-graph clusters

To group nodes into boxed clusters, put them within the same named subgraph (subgraph name {}). To have each subgraph identified within a bounding box, begin the name of the subgraph with “cluster”, as shown with the 4 boxes below.

DiagrammeR::grViz("             # All instructions are within a large character string
digraph surveillance_diagram {  # 'digraph' means 'directional graph', then the graph name 
  
  # graph statement
  #################
  graph [layout = dot,
         rankdir = TB,            
         overlap = true,
         fontsize = 10]
  

  # nodes (circles)
  #################
  node [shape = circle,                  # shape = circle
       fixedsize = true
       width = 1.3]                      # width of circles
  
  subgraph cluster_passive {
    Primary   [label = 'Primary\nFacility'] 
    Secondary [label = 'Secondary\nFacility'] 
    Tertiary  [label = 'Tertiary\nFacility'] 
    SC        [label = 'Surveillance\nCoordination',
               fontcolor = darkgreen] 
  }
  
  # nodes (boxes)
  ###############
  node [shape = box,                     # node shape
        fontname = Helvetica]            # text font in node
  
  subgraph cluster_active {
    Active [label = 'Active\nSurveillance'] 
    HCF_active [label = 'HCF\nActive Search']
  }
  
  subgraph cluster_EBD {
    EBS [label = 'Event-Based\nSurveillance (EBS)'] 
    'Social Media'
    Radio
  }
  
  subgraph cluster_CBS {
    CBS [label = 'Community-Based\nSurveillance (CBS)']
    RECOs
  }

  
  # edges
  #######
  {Primary Secondary Tertiary} -> SC [label = 'case reporting']

  Primary   -> Secondary [label = 'case transfer',
                          fontcolor = red]
  Secondary -> Tertiary [label = 'case transfer',
                          fontcolor = red]
  
  HCF_active -> Active
  
  {'Social Media' Radio} -> EBS
  
  RECOs -> CBS
}
")

Node shapes

The example below, borrowed from this tutorial, shows applied node shapes and a shorthand for serial edge connections

DiagrammeR::grViz("digraph {

graph [layout = dot, rankdir = LR]

# define the global styles of the nodes. We can override these in box if we wish
node [shape = rectangle, style = filled, fillcolor = Linen]

data1 [label = 'Dataset 1', shape = folder, fillcolor = Beige]
data2 [label = 'Dataset 2', shape = folder, fillcolor = Beige]
process [label =  'Process \n Data']
statistical [label = 'Statistical \n Analysis']
results [label= 'Results']

# edge definitions with the node IDs
{data1 data2}  -> process -> statistical -> results
}")

Outputs

How to handle and save outputs

  • Outputs will appear in RStudio’s Viewer pane, by default in the lower-right alongside Files, Plots, Packages, and Help.
  • To export you can “Save as image” or “Copy to clipboard” from the Viewer. The graphic will adjust to the specified size.

Parameterized figures

Here is a quote from this tutorial: https://mikeyharper.uk/flowcharts-in-r-using-diagrammer/

“Parameterized figures: A great benefit of designing figures within R is that we are able to connect the figures directly with our analysis by reading R values directly into our flowcharts. For example, suppose you have created a filtering process which removes values after each stage of a process, you can have a figure show the number of values left in the dataset after each stage of your process. To do this we, you can use the @@X symbol directly within the figure, then refer to this in the footer of the plot using [X]:, where X is the a unique numeric index.”

We encourage you to review this tutorial if parameterization is something you are interested in.

35.3 Alluvial/Sankey Diagrams

Load packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

We load the networkD3 package to produce the diagram, and also tidyverse for the data preparation steps.

pacman::p_load(
  networkD3,
  tidyverse)

Plotting from dataset

Plotting the connections in a dataset. Below we demonstrate using this package on the case linelist. Here is an online tutorial.

We begin by getting the case counts for each unique age category and hospital combination. We’ve removed values with missing age category for clarity. We also re-label the hospital and age_cat columns as source and target respectively. These will be the two sides of the alluvial diagram.

# counts by hospital and age category
links <- linelist %>% 
  drop_na(age_cat) %>% 
  select(hospital, age_cat) %>%
  count(hospital, age_cat) %>% 
  rename(source = hospital,
         target = age_cat)

The dataset now look like this:

Now we create a data frame of all the diagram nodes, under the column name. This consists of all the values for hospital and age_cat. Note that we ensure they are all class Character before combining them. and adjust the ID columns to be numbers instead of labels:

# The unique node names
nodes <- data.frame(
  name=c(as.character(links$source), as.character(links$target)) %>% 
    unique()
  )

nodes  # print
##                                    name
## 1                      Central Hospital
## 2                     Military Hospital
## 3                               Missing
## 4                                 Other
## 5                         Port Hospital
## 6  St. Mark's Maternity Hospital (SMMH)
## 7                                   0-4
## 8                                   5-9
## 9                                 10-14
## 10                                15-19
## 11                                20-29
## 12                                30-49
## 13                                50-69
## 14                                  70+

The we edit the links data frame, which we created above with count(). We add two numeric columns IDsource and IDtarget which will actually reflect/create the links between the nodes. These columns will hold the rownumbers (position) of the source and target nodes. 1 is subtracted so that these position numbers begin at 0 (not 1).

# match to numbers, not names
links$IDsource <- match(links$source, nodes$name)-1 
links$IDtarget <- match(links$target, nodes$name)-1

The links dataset now looks like this:

Now plot the Sankey diagram with sankeyNetwork(). You can read more about each argument by running ?sankeyNetwork in the console. Note that unless you set iterations = 0 the order of your nodes may not be as expected.

# plot
######
p <- sankeyNetwork(
  Links = links,
  Nodes = nodes,
  Source = "IDsource",
  Target = "IDtarget",
  Value = "n",
  NodeID = "name",
  units = "TWh",
  fontSize = 12,
  nodeWidth = 30,
  iterations = 0)        # ensure node order is as in data
p

Here is an example where the patient Outcome is included as well. Note in the data preparation step we have to calculate the counts of cases between age and hospital, and separately between hospital and outcome - and then bind all these counts together with bind_rows().

# counts by hospital and age category
age_hosp_links <- linelist %>% 
  drop_na(age_cat) %>% 
  select(hospital, age_cat) %>%
  count(hospital, age_cat) %>% 
  rename(source = age_cat,          # re-name
         target = hospital)

hosp_out_links <- linelist %>% 
    drop_na(age_cat) %>% 
    select(hospital, outcome) %>% 
    count(hospital, outcome) %>% 
    rename(source = hospital,       # re-name
           target = outcome)

# combine links
links <- bind_rows(age_hosp_links, hosp_out_links)

# The unique node names
nodes <- data.frame(
  name=c(as.character(links$source), as.character(links$target)) %>% 
    unique()
  )

# Create id numbers
links$IDsource <- match(links$source, nodes$name)-1 
links$IDtarget <- match(links$target, nodes$name)-1

# plot
######
p <- sankeyNetwork(Links = links,
                   Nodes = nodes,
                   Source = "IDsource",
                   Target = "IDtarget",
                   Value = "n",
                   NodeID = "name",
                   units = "TWh",
                   fontSize = 12,
                   nodeWidth = 30,
                   iterations = 0)
p

https://www.displayr.com/sankey-diagrams-r/

35.4 Event timelines

To make a timeline showing specific events, you can use the vistime package.

See this vignette

# load package
pacman::p_load(vistime,  # make the timeline
               plotly    # for interactive visualization
               )

Here is the events dataset we begin with:

p <- vistime(data)    # apply vistime

library(plotly)

# step 1: transform into a list
pp <- plotly_build(p)

# step 2: Marker size
for(i in 1:length(pp$x$data)){
  if(pp$x$data[[i]]$mode == "markers") pp$x$data[[i]]$marker$size <- 10
}

# step 3: text size
for(i in 1:length(pp$x$data)){
  if(pp$x$data[[i]]$mode == "text") pp$x$data[[i]]$textfont$size <- 10
}


# step 4: text position
for(i in 1:length(pp$x$data)){
  if(pp$x$data[[i]]$mode == "text") pp$x$data[[i]]$textposition <- "right"
}

#print
pp

35.5 DAGs

You can build a DAG manually using the DiagammeR package and DOT language as described above.

Alternatively, there are packages like ggdag and daggity

Introduction to DAGs ggdag vignette

Causal inference with dags in R

35.6 Resources

Much of the above regarding the DOT language is adapted from the tutorial at this site

Another more in-depth tutorial on DiagammeR

This page on Sankey diagrams

36 Combinations analysis

This analysis plots the frequency of different combinations of values/responses. In this example, we plot the frequency at which cases exhibited various combinations of symptoms.

This analysis is also often called:

  • “Multiple response analysis”
  • “Sets analysis”
  • “Combinations analysis”

In the example plot above, five symptoms are shown. Below each vertical bar is a line and dots indicating the combination of symptoms reflected by the bar above. To the right, horizontal bars reflect the frequency of each individual symptom.

The first method we show uses the package ggupset, and the second uses the package UpSetR.

36.1 Preparation

Load packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
  tidyverse,     # data management and visualization
  UpSetR,        # special package for combination plots
  ggupset)       # special package for combination plots

Import data

To begin, we import the cleaned linelist of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import data with the import() function from the rio package (it handles many file types like .xlsx, .csv, .rds - see the Import and export page for details).

# import case linelist 
linelist_sym <- import("linelist_cleaned.rds")

This linelist includes five “yes/no” variables on reported symptoms. We will need to transform these variables a bit to use the ggupset package to make our plot. View the data (scroll to the right to see the symptoms variables).

Re-format values

To align with the format expected by ggupset we convert the “yes” and “no” the the actual symptom name, using case_when() from dplyr. If “no”, we set the value as blank, so the values are eiter NA or the symptom.

# create column with the symptoms named, separated by semicolons
linelist_sym_1 <- linelist_sym %>% 
  
  # convert the "yes" and "no" values into the symptom name itself
  mutate(
    fever = case_when(
      fever == "yes" ~ "fever",          # if old value is "yes", new value is "fever"
      TRUE           ~ NA_character_),   # if old value is anything other than "yes", the new value is NA
         
    chills = case_when(
       chills == "yes" ~ "chills",
       TRUE           ~ NA_character_),
    
    cough = case_when(
      cough == "yes" ~ "cough",
      TRUE           ~ NA_character_),
         
    aches = case_when(
      aches == "yes" ~ "aches",
      TRUE           ~ NA_character_),
         
    vomit = case_when(
      vomit == "yes" ~ "vomit",
      TRUE           ~ NA_character_)
    )

Now we make two final columns:

  1. Concatenating (gluing together) all the symptoms of the patient (a character column)
  2. Convert the above column to class list, so it can be accepted by ggupset to make the plot

See the page on Characters and strings to learn more about the unite() function from stringr

linelist_sym_1 <- linelist_sym_1 %>% 
  unite(col = "all_symptoms",
        c(fever, chills, cough, aches, vomit), 
        sep = "; ",
        remove = TRUE,
        na.rm = TRUE) %>% 
  mutate(
    # make a copy of all_symptoms column, but of class "list" (which is required to use ggupset() in next step)
    all_symptoms_list = as.list(strsplit(all_symptoms, "; "))
    )

View the new data. Note the two columns towards the right end - the pasted combined values, and the list

36.2 ggupset

Load the package

pacman::p_load(ggupset)

Create the plot. We begin with a ggplot() and geom_bar(), but then we add the special function scale_x_upset() from the ggupset.

ggplot(
  data = linelist_sym_1,
  mapping = aes(x = all_symptoms_list)) +
geom_bar() +
scale_x_upset(
  reverse = FALSE,
  n_intersections = 10,
  sets = c("fever", "chills", "cough", "aches", "vomit"))+
labs(
  title = "Signs & symptoms",
  subtitle = "10 most frequent combinations of signs and symptoms",
  caption = "Caption here.",
  x = "Symptom combination",
  y = "Frequency in dataset")

More information on ggupset can be found online or offline in the package documentation in your RStudio Help tab ?ggupset.

36.3 UpSetR

The UpSetR package allows more customization of the plot, but it can be more difficult to execute:

Load package

pacman::p_load(UpSetR)

Data cleaning

We must convert the linelist symptoms values to 1 / 0.

# Make using upSetR

linelist_sym_2 <- linelist_sym %>% 
  
  # convert the "yes" and "no" values into the symptom name itself
  mutate(
    fever = case_when(
      fever == "yes" ~ 1,    # if old value is "yes", new value is 1
      TRUE           ~ 0),   # if old value is anything other than "yes", the new value is 0
         
    chills = case_when(
      chills == "yes" ~ 1,
      TRUE           ~ 0),
         
    cough = case_when(
      cough == "yes" ~ 1,
      TRUE           ~ 0),
         
    aches = case_when(
      aches == "yes" ~ 1,
      TRUE           ~ 0),
         
    vomit = case_when(
      vomit == "yes" ~ 1,
      TRUE           ~ 0)
    )

Now make the plot using the custom function upset() - using only the symptoms columns. You must designate which “sets” to compare (the names of the symptom columns). Alternatively, use nsets = and order.by = "freq" to only show the top X combinations.

# Make the plot
UpSetR::upset(
  select(linelist_sym_2, fever, chills, cough, aches, vomit),
  sets = c("fever", "chills", "cough", "aches", "vomit"),
  order.by = "freq",
  sets.bar.color = c("blue", "red", "yellow", "darkgreen", "orange"), # optional colors
  empty.intersections = "on",
  # nsets = 3,
  number.angles = 0,
  point.size = 3.5,
  line.size = 2, 
  mainbar.y.label = "Symptoms Combinations",
  sets.x.label = "Patients with Symptom")

37 Transmission chains

37.1 Overview

The primary tool to handle, analyse and visualise transmission chains and contact tracing data is the package epicontacts, developed by the folks at RECON. Try out the interactive plot below by hovering over the nodes for more information, dragging them to move them and clicking on them to highlight downstream cases.

37.2 Preparation

Load packages

First load the standard packages required for data import and manipulation. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
   rio,          # File import
   here,         # File locator
   tidyverse,    # Data management + ggplot2 graphics
   remotes       # Package installation from github
)

You will require the development version of epicontacts, which can be installed from github using the p_install_github() function from pacman. You only need to run this command below once, not every time you use the package (thereafter, you can use p_load() as usual).

pacman::p_install_gh("reconhub/epicontacts@timeline")

Import data

We import the dataset of cases from a simulated Ebola epidemic. If you want to download the data to follow step-by-step, see instructions in the Download handbook and data page. The dataset is imported using the import() function from the rio package. See the page on Import and export for various ways to import data.

# import the linelist
linelist <- import("linelist_cleaned.xlsx")

The first 50 rows of the linelist are displayed below. Of particular interest are the columns case_id, generation, infector, and source.

Creating an epicontacts object

We then need to create an epicontacts object, which requires two types of data:

  • a linelist documenting cases where columns are variables and rows correspond to unique cases
  • a list of edges defining links between cases on the basis of their unique IDs (these can be contacts, transmission events, etc.)

As we already have a linelist, we just need to create a list of edges between cases, more specifically between their IDs. We can extract transmission links from the linelist by linking the infector column with the case_id column. At this point we can also add “edge properties”, by which we mean any variable describing the link between the two cases, not the cases themselves. For illustration, we will add a location variable describing the location of the transmission event, and a duration variable describing the duration of the contact in days.

In the code below, the dplyr function transmute is similar to mutate, except it only keeps the columns we have specified within the function. The drop_na function will filter out any rows where the specified columns have an NA value; in this case, we only want to keep the rows where the infector is known.

## generate contacts
contacts <- linelist %>%
  transmute(
    infector = infector,
    case_id = case_id,
    location = sample(c("Community", "Nosocomial"), n(), TRUE),
    duration = sample.int(10, n(), TRUE)
  ) %>%
  drop_na(infector)

We can now create the epicontacts object using the make_epicontacts function. We need to specify which column in the linelist points to the unique case identifier, as well as which columns in the contacts point to the unique identifiers of the cases involved in each link. These links are directional in that infection is going from the infector to the case, so we need to specify the from and to arguments accordingly. We therefore also set the directed argument to TRUE, which will affect future operations.

## generate epicontacts object
epic <- make_epicontacts(
  linelist = linelist,
  contacts = contacts,
  id = "case_id",
  from = "infector",
  to = "case_id",
  directed = TRUE
)

Upon examining the epicontacts objects, we can see that the case_id column in the linelist has been renamed to id and the case_id and infector columns in the contacts have been renamed to from and to. This ensures consistency in subsequent handling, visualisation and analysis operations.

## view epicontacts object
epic
## 
## /// Epidemiological Contacts //
## 
##   // class: epicontacts
##   // 5,888 cases in linelist; 3,800 contacts; directed 
## 
##   // linelist
## 
## # A tibble: 5,888 x 30
##    id     generation date_infection date_onset date_hospitalis~ date_outcome outcome gender   age age_unit age_years age_cat age_cat5 hospital     lon   lat infector
##    <chr>       <dbl> <date>         <date>     <date>           <date>       <chr>   <chr>  <dbl> <chr>        <dbl> <fct>   <fct>    <chr>      <dbl> <dbl> <chr>   
##  1 5fe599          4 2014-05-08     2014-05-13 2014-05-15       NA           <NA>    m          2 years            2 0-4     0-4      Other      -13.2  8.47 f547d6  
##  2 8689b7          4 NA             2014-05-13 2014-05-14       2014-05-18   Recover f          3 years            3 0-4     0-4      Missing    -13.2  8.45 <NA>    
##  3 11f8ea          2 NA             2014-05-16 2014-05-18       2014-05-30   Recover m         56 years           56 50-69   55-59    St. Mark'~ -13.2  8.46 <NA>    
##  4 b8812a          3 2014-05-04     2014-05-18 2014-05-20       NA           <NA>    f         18 years           18 15-19   15-19    Port Hosp~ -13.2  8.48 f90f5f  
##  5 893f25          3 2014-05-18     2014-05-21 2014-05-22       2014-05-29   Recover m          3 years            3 0-4     0-4      Military ~ -13.2  8.46 11f8ea  
##  6 be99c8          3 2014-05-03     2014-05-22 2014-05-23       2014-05-24   Recover f         16 years           16 15-19   15-19    Port Hosp~ -13.2  8.46 aec8ec  
##  7 07e3e8          4 2014-05-22     2014-05-27 2014-05-29       2014-06-01   Recover f         16 years           16 15-19   15-19    Missing    -13.2  8.46 893f25  
##  8 369449          4 2014-05-28     2014-06-02 2014-06-03       2014-06-07   Death   f          0 years            0 0-4     0-4      Missing    -13.2  8.46 133ee7  
##  9 f393b4          4 NA             2014-06-05 2014-06-06       2014-06-18   Recover m         61 years           61 50-69   60-64    Missing    -13.2  8.46 <NA>    
## 10 1389ca          4 NA             2014-06-05 2014-06-07       2014-06-09   Death   f         27 years           27 20-29   25-29    Missing    -13.3  8.47 <NA>    
## # ... with 5,878 more rows, and 13 more variables: source <chr>, wt_kg <dbl>, ht_cm <dbl>, ct_blood <dbl>, fever <chr>, chills <chr>, cough <chr>, aches <chr>,
## #   vomit <chr>, temp <dbl>, time_admission <chr>, bmi <dbl>, days_onset_hosp <dbl>
## 
##   // contacts
## 
## # A tibble: 3,800 x 4
##    from   to     location   duration
##    <chr>  <chr>  <chr>         <int>
##  1 f547d6 5fe599 Community         5
##  2 f90f5f b8812a Nosocomial        4
##  3 11f8ea 893f25 Nosocomial       10
##  4 aec8ec be99c8 Nosocomial        8
##  5 893f25 07e3e8 Community         4
##  6 133ee7 369449 Community         4
##  7 996f3a 2978ac Nosocomial        2
##  8 133ee7 57a565 Community         5
##  9 37a6f6 fc15ef Nosocomial        8
## 10 9f6884 2eaa9a Nosocomial        9
## # ... with 3,790 more rows

37.3 Handling

Subsetting

The subset() method for epicontacts objects allows for, among other things, filtering of networks based on properties of the linelist (“node attributes”) and the contacts database (“edge attributes”). These values must be passed as named lists to the respective argument. For example, in the code below we are keeping only the male cases in the linelist that have an infection date between April and July 2014 (dates are specified as ranges), and transmission links that occured in the hospital.

sub_attributes <- subset(
  epic,
  node_attribute = list(
    gender = "m",
    date_infection = as.Date(c("2014-04-01", "2014-07-01"))
  ), 
  edge_attribute = list(location = "Nosocomial")
)
sub_attributes
## 
## /// Epidemiological Contacts //
## 
##   // class: epicontacts
##   // 69 cases in linelist; 1,948 contacts; directed 
## 
##   // linelist
## 
## # A tibble: 69 x 30
##    id     generation date_infection date_onset date_hospitalis~ date_outcome outcome gender   age age_unit age_years age_cat age_cat5 hospital     lon   lat infector
##    <chr>       <dbl> <date>         <date>     <date>           <date>       <chr>   <chr>  <dbl> <chr>        <dbl> <fct>   <fct>    <chr>      <dbl> <dbl> <chr>   
##  1 5fe599          4 2014-05-08     2014-05-13 2014-05-15       NA           <NA>    m          2 years            2 0-4     0-4      Other      -13.2  8.47 f547d6  
##  2 893f25          3 2014-05-18     2014-05-21 2014-05-22       2014-05-29   Recover m          3 years            3 0-4     0-4      Military ~ -13.2  8.46 11f8ea  
##  3 2978ac          4 2014-05-30     2014-06-06 2014-06-08       2014-06-15   Death   m         12 years           12 10-14   10-14    Port Hosp~ -13.2  8.48 996f3a  
##  4 57a565          4 2014-05-28     2014-06-13 2014-06-15       NA           Death   m         42 years           42 30-49   40-44    Military ~ -13.3  8.46 133ee7  
##  5 fc15ef          6 2014-06-14     2014-06-16 2014-06-17       2014-07-09   Recover m         19 years           19 15-19   15-19    Missing    -13.2  8.48 37a6f6  
##  6 99e8fa          7 2014-06-24     2014-06-28 2014-06-29       2014-07-09   Recover m         19 years           19 15-19   15-19    Port Hosp~ -13.2  8.47 ab634e  
##  7 f327be          6 2014-06-14     2014-07-12 2014-07-13       2014-07-14   Death   m         31 years           31 30-49   30-34    St. Mark'~ -13.2  8.46 a15e13  
##  8 90e5fe          5 2014-06-18     2014-07-13 2014-07-14       2014-07-16   <NA>    m         67 years           67 50-69   65-69    Port Hosp~ -13.3  8.46 ea3740  
##  9 a47529          5 2014-06-13     2014-07-17 2014-07-18       2014-07-26   Death   m         45 years           45 30-49   45-49    Military ~ -13.2  8.48 a2086d  
## 10 da8ecb          5 2014-06-20     2014-07-18 2014-07-20       2014-08-01   <NA>    m         12 years           12 10-14   10-14    Missing    -13.2  8.48 eb2277  
## # ... with 59 more rows, and 13 more variables: source <chr>, wt_kg <dbl>, ht_cm <dbl>, ct_blood <dbl>, fever <chr>, chills <chr>, cough <chr>, aches <chr>,
## #   vomit <chr>, temp <dbl>, time_admission <chr>, bmi <dbl>, days_onset_hosp <dbl>
## 
##   // contacts
## 
## # A tibble: 1,948 x 4
##    from   to     location   duration
##    <chr>  <chr>  <chr>         <int>
##  1 f90f5f b8812a Nosocomial        4
##  2 11f8ea 893f25 Nosocomial       10
##  3 aec8ec be99c8 Nosocomial        8
##  4 996f3a 2978ac Nosocomial        2
##  5 37a6f6 fc15ef Nosocomial        8
##  6 9f6884 2eaa9a Nosocomial        9
##  7 8e104d ddddee Nosocomial        8
##  8 a15e13 f327be Nosocomial        9
##  9 567136 8ebf6e Nosocomial        3
## 10 36e2e7 6d788e Nosocomial        5
## # ... with 1,938 more rows

We can use the thin function to either filter the linelist to include cases that are found in the contacts by setting the argument what = "linelist", or filter the contacts to include cases that are found in the linelist by setting the argument what = "contacts". In the code below, we are further filtering the epicontacts object to keep only the transmission links involving the male cases infected between April and July which we had filtered for above. We can see that only two known transmission links fit that specification.

sub_attributes <- thin(sub_attributes, what = "contacts")
nrow(sub_attributes$contacts)
## [1] 5

In addition to subsetting by node and edge attributes, networks can be pruned to only include components that are connected to certain nodes. The cluster_id argument takes a vector of case IDs and returns the linelist of individuals that are linked, directly or indirectly, to those IDs. In the code below, we can see that a total of 13 linelist cases are involved in the clusters containing 2ae019 and 71577a.

sub_id <- subset(epic, cluster_id = c("2ae019","71577a"))
nrow(sub_id$linelist)
## [1] 13

The subset() method for epicontacts objects also allows filtering by cluster size using the cs, cs_min and cs_max arguments. In the code below, we are keeping only the cases linked to clusters of 10 cases or larger, and can see that 271 linelist cases are involved in such clusters.

sub_cs <- subset(epic, cs_min = 10)
nrow(sub_cs$linelist)
## [1] 271

Accessing IDs

The get_id() function retrieves information on case IDs in the dataset, and can be parameterized as follows:

  • linelist: IDs in the line list data
  • contacts: IDs in the contact dataset (“from” and “to” combined)
  • from: IDs in the “from” column of contact datset
  • to IDs in the “to” column of contact dataset
  • all: IDs that appear anywhere in either dataset
  • common: IDs that appear in both contacts dataset and line list

For example, what are the first ten IDs in the contacts dataset?

contacts_ids <- get_id(epic, "contacts")
head(contacts_ids, n = 10)
##  [1] "f547d6" "f90f5f" "11f8ea" "aec8ec" "893f25" "133ee7" "996f3a" "37a6f6" "9f6884" "4802b1"

How many IDs are found in both the linelist and the contacts?

length(get_id(epic, "common"))
## [1] 4352

37.4 Visualization

Basic plotting

All visualisations of epicontacts objects are handled by the plot function. We will first filter the epicontacts object to include only the cases with onset dates in June 2014 using the subset function, and only include the contacts linked to those cases using the thin function.

## subset epicontacts object
sub <- epic %>%
  subset(
    node_attribute = list(date_onset = c(as.Date(c("2014-06-30", "2014-06-01"))))
  ) %>%
 thin("contacts")

We can then create the basic, interactive plot very simply as follows:

## plot epicontacts object
plot(
  sub,
  width = 700,
  height = 700
)

You can move the nodes around by dragging them, hover over them for more information and click on them to highlight connected cases.

There are a large number of arguments to further modify this plot. We will cover the main ones here, but check out the documentation via ?vis_epicontacts (the function called when using plot on an epicontacts object) to get a full description of the function arguments.

Visualising node attributes

Node color, node shape and node size can be mapped to a given column in the linelist using the node_color, node_shape and node_size arguments. This is similar to the aes syntax you may recognise from ggplot2.

The specific colors, shapes and sizes of nodes can be specified as follows:

  • Colors via the col_pal argument, either by providing a name list for manual specification of each color as done below, or by providing a color palette function such as colorRampPalette(c("black", "red", "orange")), which would provide a gradient of colours between the ones specified.

  • Shapes by passing a named list to the shapes argument, specifying one shape for each unique element in the linelist column specified by the node_shape argument. See codeawesome for available shapes.

  • Size by passing a size range of the nodes to the size_range argument.

Here an example, where color represents the outcome, shape the gender and size the age:

plot(
  sub, 
  node_color = "outcome",
  node_shape = "gender",
  node_size = 'age',
  col_pal = c(Death = "firebrick", Recover = "green"),
  shapes = c(f = "female", m = "male"),
  size_range = c(40, 60),
  height = 700,
  width = 700
)

Visualising edge attributes

Edge color, width and linetype can be mapped to a given column in the contacts dataframe using the edge_color, edge_width and edge_linetype arguments. The specific colors and widths of the edges can be specified as follows:

  • Colors via the edge_col_pal argument, in the same manner used for col_pal.

  • Widths by passing a size range of the nodes to the width_range argument.

Here an example:

plot(
  sub, 
  node_color = "outcome",
  node_shape = "gender",
  node_size = 'age',
  col_pal = c(Death = "firebrick", Recover = "green"),
  shapes = c(f = "female", m = "male"),
  size_range = c(40, 60),
  edge_color = 'location',
  edge_linetype = 'location',
  edge_width = 'duration',
  edge_col_pal = c(Community = "orange", Nosocomial = "purple"),
  width_range = c(1, 3),
  height = 700,
  width = 700
)

Temporal axis

We can also visualise the network along a temporal axis by mapping the x_axis argument to a column in the linelist. In the example below, the x-axis represents the date of symptom onset. We have also specified the arrow_size argument to ensure the arrows are not too large, and set label = FALSE to make the figure less cluttered.

plot(
  sub,
  x_axis = "date_onset",
  node_color = "outcome",
  col_pal = c(Death = "firebrick", Recover = "green"),
  arrow_size = 0.5,
  node_size = 13,
  label = FALSE,
  height = 700,
  width = 700
)

There are a large number of additional arguments to futher specify how this network is visualised along a temporal axis, which you can check out via ?vis_temporal_interactive (the function called when using plot on an epicontacts object with x_axis specified). We’ll go through some below.

Specifying transmission tree shape

There are two main shapes that the transmission tree can assume, specified using the network_shape argument. The first is a branching shape as shown above, where a straight edge connects any two nodes. This is the most intuitive representation, however can result in overlapping edges in a densely connected network. The second shape is rectangle, which produces a tree resembling a phylogeny. For example:

plot(
  sub,
  x_axis = "date_onset",
  network_shape = "rectangle",
  node_color = "outcome",
  col_pal = c(Death = "firebrick", Recover = "green"),
  arrow_size = 0.5,
  node_size = 13,
  label = FALSE,
  height = 700,
  width = 700
)

Each case node can be assigned a unique vertical position by toggling the position_dodge argument. The position of unconnected cases (i.e. with no reported contacts) is specified using the unlinked_pos argument.

plot(
  sub,
  x_axis = "date_onset",
  network_shape = "rectangle",
  node_color = "outcome",
  col_pal = c(Death = "firebrick", Recover = "green"),
  position_dodge = TRUE,
  unlinked_pos = "bottom",
  arrow_size = 0.5,
  node_size = 13,
  label = FALSE,
  height = 700,
  width = 700
)

The position of the parent node relative to the children nodes can be specified using the parent_pos argument. The default option is to place the parent node in the middle, however it can be placed at the bottom (parent_pos = 'bottom') or at the top (parent_pos = 'top').

plot(
  sub,
  x_axis = "date_onset",
  network_shape = "rectangle",
  node_color = "outcome",
  col_pal = c(Death = "firebrick", Recover = "green"),
  parent_pos = "top",
  arrow_size = 0.5,
  node_size = 13,
  label = FALSE,
  height = 700,
  width = 700
)

Saving plots and figures

You can save a plot as an interactive, self-contained html file with the visSave function from the VisNetwork package:

plot(
  sub,
  x_axis = "date_onset",
  network_shape = "rectangle",
  node_color = "outcome",
  col_pal = c(Death = "firebrick", Recover = "green"),
  parent_pos = "top",
  arrow_size = 0.5,
  node_size = 13,
  label = FALSE,
  height = 700,
  width = 700
) %>%
  visNetwork::visSave("network.html")

Saving these network outputs as an image is unfortunately less easy and requires you to save the file as an html and then take a screenshot of this file using the webshot package. In the code below, we are converting the html file saved above into a PNG:

webshot(url = "network.html", file = "network.png")

Timelines

You can also case timelines to the network, which are represented on the x-axis of each case. This can be used to visualise case locations, for example, or time to outcome. To generate a timeline, we need to create a data.frame of at least three columns indicating the case ID, the start date of the “event” and the end of date of the “event”. You can also add any number of other columns which can then be mapped to node and edge properties of the timeline. In the code below, we generate a timeline going from the date of symptom onset to the date of outcome, and keep the outcome and hospital variables which we use to define the node shape and colour. Note that you can have more than one timeline row/event per case, for example if a case is transferred between multiple hospitals.

## generate timeline
timeline <- linelist %>%
  transmute(
    id = case_id,
    start = date_onset,
    end = date_outcome,
    outcome = outcome,
    hospital = hospital
  )

We then pass the timeline element to the timeline argument. We can map timeline attributes to timeline node colours, shapes and sizes in the same way defined in previous sections, except that we have two nodes: the start and end node of each timeline, which have seperate arguments. For example, tl_start_node_color defines which timeline column is mapped to the colour of the start node, while tl_end_node_shape defines which timeline column is mapped to the shape of the end node. We can also map colour, width, linetype and labels to the timeline edge via the tl_edge_* arguments.

See ?vis_temporal_interactive (the function called when plotting an epicontacts object) for detailed documentation on the arguments. Each argument is annotated in the code below too:

## define shapes
shapes <- c(
  f = "female",
  m = "male",
  Death = "user-times",
  Recover = "heartbeat",
  "NA" = "question-circle"
)

## define colours
colours <- c(
  Death = "firebrick",
  Recover = "green",
  "NA" = "grey"
)

## make plot
plot(
  sub,
  ## max x coordinate to date of onset
  x_axis = "date_onset",
  ## use rectangular network shape
  network_shape = "rectangle",
  ## mape case node shapes to gender column
  node_shape = "gender",
  ## we don't want to map node colour to any columns - this is important as the
  ## default value is to map to node id, which will mess up the colour scheme
  node_color = NULL,
  ## set case node size to 30 (as this is not a character, node_size is not
  ## mapped to a column but instead interpreted as the actual node size)
  node_size = 30,
  ## set transmission link width to 4 (as this is not a character, edge_width is
  ## not mapped to a column but instead interpreted as the actual edge width)
  edge_width = 4,
  ## provide the timeline object
  timeline = timeline,
  ## map the shape of the end node to the outcome column in the timeline object
  tl_end_node_shape = "outcome",
  ## set the size of the end node to 15 (as this is not a character, this
  ## argument is not mapped to a column but instead interpreted as the actual
  ## node size)
  tl_end_node_size = 15,
  ## map the colour of the timeline edge to the hospital column
  tl_edge_color = "hospital",
  ## set the width of the timeline edge to 2 (as this is not a character, this
  ## argument is not mapped to a column but instead interpreted as the actual
  ## edge width)
  tl_edge_width = 2,
  ## map edge labels to the hospital variable
  tl_edge_label = "hospital",
  ## specify the shape for everyone node attribute (defined above)
  shapes = shapes,
  ## specify the colour palette (defined above)
  col_pal = colours,
  ## set the size of the arrow to 0.5
  arrow_size = 0.5,
  ## use two columns in the legend
  legend_ncol = 2,
  ## set font size
  font_size = 15,
  ## define formatting for dates
  date_labels = c("%d %b %Y"),
  ## don't plot the ID labels below nodes
  label = FALSE,
  ## specify height
  height = 1000,
  ## specify width
  width = 1200,
  ## ensure each case node has a unique y-coordinate - this is very important
  ## when using timelines, otherwise you will have overlapping timelines from
  ## different cases
  position_dodge = TRUE
)
## Warning in assert_timeline(timeline, x, x_axis): 5865 timeline row(s) removed as ID not found in linelist or start/end date is NA

37.5 Analysis

Summarising

We can get an overview of some of the network properties using the summary function.

## summarise epicontacts object
summary(epic)
## 
## /// Overview //
##   // number of unique IDs in linelist: 5888
##   // number of unique IDs in contacts: 5511
##   // number of unique IDs in both: 4352
##   // number of contacts: 3800
##   // contacts with both cases in linelist: 56.868 %
## 
## /// Degrees of the network //
##   // in-degree summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.5392  1.0000  1.0000 
## 
##   // out-degree summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.5392  1.0000  6.0000 
## 
##   // in and out degree summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   1.000   1.078   1.000   7.000 
## 
## /// Attributes //
##   // attributes in linelist:
##  generation date_infection date_onset date_hospitalisation date_outcome outcome gender age age_unit age_years age_cat age_cat5 hospital lon lat infector source wt_kg ht_cm ct_blood fever chills cough aches vomit temp time_admission bmi days_onset_hosp
## 
##   // attributes in contacts:
##  location duration

For example, we can see that only 57% of contacts have both cases in the linelist; this means that the we do not have linelist data on a significant number of cases involved in these transmission chains.

Pairwise characteristics

The get_pairwise() function allows processing of variable(s) in the line list according to each pair in the contact dataset. For the following example, date of onset of disease is extracted from the line list in order to compute the difference between disease date of onset for each pair. The value that is produced from this comparison represents the serial interval (si).

si <- get_pairwise(epic, "date_onset")   
summary(si)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.00    9.00   11.01   15.00   99.00    1820
tibble(si = si) %>%
  ggplot(aes(si)) +
  geom_histogram() +
  labs(
    x = "Serial interval",
    y = "Frequency"
  )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1820 rows containing non-finite values (stat_bin).

The get_pairwise() will interpret the class of the column being used for comparison, and will adjust its method of comparing the values accordingly. For numbers and dates (like the si example above), the function will subtract the values. When applied to columns that are characters or categorical, get_pairwise() will paste values together. Because the function also allows for arbitrary processing (see “f” argument), these discrete combinations can be easily tabulated and analyzed.

head(get_pairwise(epic, "gender"), n = 10)
##  [1] "f -> m" NA       "m -> m" NA       "m -> f" "f -> f" NA       "f -> m" NA       "m -> f"
get_pairwise(epic, "gender", f = table)
##            values.to
## values.from   f   m
##           f 464 516
##           m 510 468
fisher.test(get_pairwise(epic, "gender", f = table))
## 
##  Fisher's Exact Test for Count Data
## 
## data:  get_pairwise(epic, "gender", f = table)
## p-value = 0.03758
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.6882761 0.9892811
## sample estimates:
## odds ratio 
##  0.8252575

Here, we see a significant association between transmission links and gender.

Identifying clusters

The get_clusters() function can be used for to identify connected components in an epicontacts object. First, we use it to retrieve a data.frame containing the cluster information:

clust <- get_clusters(epic, output = "data.frame")
table(clust$cluster_size)
## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
## 1536 1680 1182  784  545  342  308  208  171  100   99   24   26   42
ggplot(clust, aes(cluster_size)) +
  geom_bar() +
  labs(
    x = "Cluster size",
    y = "Frequency"
  )

Let us look at the largest clusters. For this, we add cluster information to the epicontacts object and then subset it to keep only the largest clusters:

epic <- get_clusters(epic)
max_size <- max(epic$linelist$cluster_size)
plot(subset(epic, cs = max_size))

Calculating degrees

The degree of a node corresponds to its number of edges or connections to other nodes. get_degree() provides an easy method for calculating this value for epicontacts networks. A high degree in this context indicates an individual who was in contact with many others. The type argument indicates that we want to count both the in-degree and out-degree, the only_linelist argument indicates that we only want to calculate the degree for cases in the linelist.

deg_both <- get_degree(epic, type = "both", only_linelist = TRUE)

Which individuals have the ten most contacts?

head(sort(deg_both, decreasing = TRUE), 10)
## 916d0a 858426 6833d7 f093ea 11f8ea 3a4372 38fc71 c8c4d5 a127a7 02d8fd 
##      7      6      6      6      5      5      5      5      5      5

What is the mean number of contacts?

mean(deg_both)
## [1] 1.078473

37.6 Resources

The epicontacts page provides an overview of the package functions and includes some more in-depth vignettes.

The github page can be used to raise issues and request features.

38 Phylogenetic trees

38.1 Overview

Phylogenetic trees are used to visualize and describe the relatedness and evolution of organisms based on the sequence of their genetic code.

They can be constructed from genetic sequences using distance-based methods (such as neighbor-joining method) or character-based methods (such as maximum likelihood and Bayesian Markov Chain Monte Carlo method). Next-generation sequencing (NGS) has become more affordable and is becoming more widely used in public health to describe pathogens causing infectious diseases. Portable sequencing devices decrease the turn around time and hold promises to make data available for the support of outbreak investigation in real-time. NGS data can be used to identify the origin or source of an outbreak strain and its propagation, as well as determine presence of antimicrobial resistance genes. To visualize the genetic relatedness between samples a phylogenetic tree is constructed.

In this page we will learn how to use the ggtree package, which allows for combined visualization of phylogenetic trees with additional sample data in form of a dataframe. This will enable us to observe patterns and improve understanding of the outbreak dynamic.

38.2 Preparation

Load packages

This code chunk shows the loading of required packages. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
  rio,             # import/export
  here,            # relative file paths
  tidyverse,       # general data management and visualization
  ape,             # to import and export phylogenetic files
  ggtree,          # to visualize phylogenetic files
  treeio,          # to visualize phylogenetic files
  ggnewscale)      # to add additional layers of color schemes

Import data

The data for this page can be downloaded with the instructions on the Download handbook and data page.

There are several different formats in which a phylogenetic tree can be stored (eg. Newick, NEXUS, Phylip). A common one is the Newick file format (.nwk), which is the standard for representing trees in computer-readable form. This means an entire tree can be expressed in a string format such as “((t2:0.04,t1:0.34):0.89,(t5:0.37,(t4:0.03,t3:0.67):0.9):0.59);”, listing all nodes and tips and their relationship (branch length) to each other.

Note: It is important to understand that the phylogenetic tree file in itself does not contain sequencing data, but is merely the result of the genetic distances between the sequences. We therefore cannot extract sequencing data from a tree file.

First, we use the read.tree() function from ape package to import a Newick phylogenetic tree file in .txt format, and store it in a list object of class “phylo”. If necessary, use the here() function from the here package to specify the relative file path.

Note: In this case the newick tree is saved as a .txt file for easier handling and downloading from Github.

tree <- ape::read.tree("Shigella_tree.txt")

We inspect our tree object and see it contains 299 tips (or samples) and 236 nodes.

tree
## 
## Phylogenetic tree with 299 tips and 236 internal nodes.
## 
## Tip labels:
##   SRR5006072, SRR4192106, S18BD07865, S18BD00489, S17BD08906, S17BD05939, ...
## Node labels:
##   17, 29, 100, 67, 100, 100, ...
## 
## Rooted; includes branch lengths.

Second, we import a table stored as a .csv file with additional information for each sequenced sample, such as gender, country of origin and attributes for antimicrobial resistance, using the import() function from the rio package:

sample_data <- import("sample_data_Shigella_tree.csv")

Below are the first 50 rows of the data:

Clean and inspect

We clean and inspect our data: In order to assign the correct sample data to the phylogenetic tree, the values in the column Sample_ID in the sample_data data frame need to match the tip.labels values in the tree file:

We check the formatting of the tip.labels in the tree file by looking at the first 6 entries using with head() from base R.

head(tree$tip.label) 
## [1] "SRR5006072" "SRR4192106" "S18BD07865" "S18BD00489" "S17BD08906" "S17BD05939"

We also make sure the first column in our sample_data data frame is Sample_ID. We look at the column names of our dataframe using colnames() from base R.

colnames(sample_data)   
##  [1] "Sample_ID"                  "serotype"                   "Country"                    "Continent"                  "Travel_history"            
##  [6] "Year"                       "Belgium"                    "Source"                     "Gender"                     "gyrA_mutations"            
## [11] "macrolide_resistance_genes" "MIC_AZM"                    "MIC_CIP"

We look at the Sample_IDs in the data frame to make sure the formatting is the same than in the tip.label (eg. letters are all capitals, no extra underscores _ between letters and numbers, etc.)

head(sample_data$Sample_ID) # we again inspect only the first 6 using head()
## [1] "S17BD05944" "S15BD07413" "S18BD07247" "S19BD07384" "S18BD07338" "S18BD02657"

We can also compare if all samples are present in the tree file and vice versa by generating a logical vector of TRUE or FALSE where they do or do not match. These are not printed here, for simplicity.

sample_data$Sample_ID %in% tree$tip.label

tree$tip.label %in% sample_data$Sample_ID

We can use these vectors to show any sample IDs that are not on the tree (there are none).

sample_data$Sample_ID[!tree$tip.label %in% sample_data$Sample_ID]
## character(0)

Upon inspection we can see that the format of Sample_ID in the dataframe corresponds to the format of sample names at the tip.labels. These do not have to be sorted in the same order to be matched.

We are ready to go!

38.3 Simple tree visualization

Different tree layouts

ggtree offers many different layout formats and some may be more suitable for your specific purpose than others. Below are a few demonstrations. For other options see this online book.

Here are some example tree layouts:

ggtree(tree)                                            # simple linear tree
ggtree(tree,  branch.length = "none")                   # simple linear tree with all tips aligned
ggtree(tree, layout="circular")                         # simple circular tree
ggtree(tree, layout="circular", branch.length = "none") # simple circular tree with all tips aligned

Simple tree plus sample data

The %<+% operator is used to connect the sample_data data frame to the tree file. The most easy annotation of your tree is the addition of the sample names at the tips, as well as coloring of tip points and if desired the branches:

Here is an example of a circular tree:

ggtree(tree, layout = "circular", branch.length = 'none') %<+% sample_data + # %<+% adds dataframe with sample data to tree
  aes(color = I(Belgium))+                       # color the branches according to a variable in your dataframe
  scale_color_manual(
    name = "Sample Origin",                      # name of your color scheme (will show up in the legend like this)
    breaks = c("Yes", "No"),                     # the different options in your variable
    labels = c("NRCSS Belgium", "Other"),        # how you want the different options named in your legend, allows for formatting
    values = c("blue", "black"),                  # the color you want to assign to the variable 
    na.value = "black") +                        # color NA values in black as well
  new_scale_color()+                             # allows to add an additional color scheme for another variable
    geom_tippoint(
      mapping = aes(color = Continent),          # tip color by continent. You may change shape adding "shape = "
      size = 1.5)+                               # define the size of the point at the tip
  scale_color_brewer(
    name = "Continent",                    # name of your color scheme (will show up in the legend like this)
    palette = "Set1",                      # we choose a set of colors coming with the brewer package
    na.value = "grey") +                    # for the NA values we choose the color grey
  geom_tiplab(                             # adds name of sample to tip of its branch 
    color = 'black',                       # (add as many text lines as you wish with + , but you may need to adjust offset value to place them next to each other)
    offset = 1,
    size = 1,
    geom = "text",
    align = TRUE)+    
  ggtitle("Phylogenetic tree of Shigella sonnei")+       # title of your graph
  theme(
    axis.title.x = element_blank(), # removes x-axis title
    axis.title.y = element_blank(), # removes y-axis title
    legend.title = element_text(    # defines font size and format of the legend title
      face = "bold",
      size = 12),   
    legend.text=element_text(       # defines font size and format of the legend text
      face = "bold",
      size = 10),  
    plot.title = element_text(      # defines font size and format of the plot title
      size = 12,
      face = "bold"),  
    legend.position = "bottom",     # defines placement of the legend
    legend.box = "vertical",        # defines placement of the legend
    legend.margin = margin())   

You can export your tree plot with ggsave() as you would any other ggplot object. Written this way, ggsave() saves the last image produced to the file path you specify. Remember that you can use here() and relative file paths to easily save in subfolders, etc.

ggsave("example_tree_circular_1.png", width = 12, height = 14)

38.4 Tree manipulation

Sometimes you may have a very large phylogenetic tree and you are only interested in one part of the tree. For example, if you produced a tree including historical or international samples to get a large overview of where your dataset might fit in the bigger picture. But then to look closer at your data you want to inspect only that portion of the bigger tree.

Since the phylogenetic tree file is just the output of sequencing data analysis, we can not manipulate the order of the nodes and branches in the file itself. These have already been determined in previous analysis from the raw NGS data. We are able though to zoom into parts, hide parts and even subset part of the tree.

Zoom in

If you don’t want to “cut” your tree, but only inspect part of it more closely you can zoom in to view a specific part.

First, we plot the entire tree in linear format and add numeric labels to each node in the tree.

p <- ggtree(tree,) %<+% sample_data +
  geom_tiplab(size = 1.5) +                # labels the tips of all branches with the sample name in the tree file
  geom_text2(
    mapping = aes(subset = !isTip,
                  label = node),
    size = 5,
    color = "darkred",
    hjust = 1,
    vjust = 1)                            # labels all the nodes in the tree

p  # print

To zoom in to one particular branch (sticking out to the right), use viewClade() on the ggtree object p and provide the node number to get a closer look:

viewClade(p, node = 452)

Collapsing branches

However, we may want to ignore this branch and can collapse it at that same node (node nr. 452) using collapse(). This tree is defined as p_collapsed.

p_collapsed <- collapse(p, node = 452)
p_collapsed

For clarity, when we print p_collapsed, we add a geom_point2() (a blue diamond) at the node of the collapsed branch.

p_collapsed + 
geom_point2(aes(subset = (node == 452)),  # we assign a symbol to the collapsed node
            size = 5,                     # define the size of the symbol
            shape = 23,                   # define the shape of the symbol
            fill = "steelblue")           # define the color of the symbol

Subsetting a tree

If we want to make a more permanent change and create a new, reduced tree to work with we can subset part of it with tree_subset(). Then you can save it as new newick tree file or .txt file.

First, we inspect the tree nodes and tip labels in order to decide what to subset.

ggtree(
  tree,
  branch.length = 'none',
  layout = 'circular') %<+% sample_data +               # we add the asmple data using the %<+% operator
  geom_tiplab(size = 1)+                                # label tips of all branches with sample name in tree file
  geom_text2(
    mapping = aes(subset = !isTip, label = node),
    size = 3,
    color = "darkred") +                                # labels all the nodes in the tree
 theme(
   legend.position = "none",                            # removes the legend all together
   axis.title.x = element_blank(),
   axis.title.y = element_blank(),
   plot.title = element_text(size = 12, face="bold"))

Now, say we have decided to subset the tree at node 528 (keep only tips within this branch after node 528) and we save it as a new sub_tree1 object:

sub_tree1 <- tree_subset(
  tree,
  node = 528)                                            # we subset the tree at node 528

Lets have a look at the subset tree 1:

ggtree(sub_tree1) +
  geom_tiplab(size = 3) +
  ggtitle("Subset tree 1")

You can also subset based on one particular sample, specifying how many nodes “backwards” you want to include. Let’s subset the same part of the tree based on a sample, in this case S17BD07692, going back 9 nodes and we save it as a new sub_tree2 object:

sub_tree2 <- tree_subset(
  tree,
  "S17BD07692",
  levels_back = 9) # levels back defines how many nodes backwards from the sample tip you want to go

Lets have a look at the subset tree 2:

ggtree(sub_tree2) +
  geom_tiplab(size =3)  +
  ggtitle("Subset tree 2")

You can also save your new tree either as a Newick type or even a text file using the write.tree() function from ape package:

# to save in .nwk format
ape::write.tree(sub_tree2, file='data/phylo/Shigella_subtree_2.nwk')

# to save in .txt format
ape::write.tree(sub_tree2, file='data/phylo/Shigella_subtree_2.txt')

Rotating nodes in a tree

As mentioned before we cannot change the order of tips or nodes in the tree, as this is based on their genetic relatedness and is not subject to visual manipulation. But we can rote branches around nodes if that eases our visualization.

First, we plot our new subset tree 2 with node labels to choose the node we want to manipulate and store it an a ggtree plot object p.

p <- ggtree(sub_tree2) +  
  geom_tiplab(size = 4) +
  geom_text2(aes(subset=!isTip, label=node), # labels all the nodes in the tree
             size = 5,
             color = "darkred", 
             hjust = 1, 
             vjust = 1) 
p

We can then manipulate nodes by applying ggtree::rotate() or ggtree::flip(): Note: to illustrate which nodes we are manipulating we first apply the geom_hilight() function from ggtree to highlight the samples in the nodes we are interested in and store that ggtree plot object in a new object p1.

p1 <- p + geom_hilight(  # highlights node 39 in blue, "extend =" allows us to define the length of the color block
  node = 39,
  fill = "steelblue",
  extend = 0.0017) +  
geom_hilight(            # highlights the node 37 in yellow
  node = 37,
  fill = "yellow",
  extend = 0.0017) +               
ggtitle("Original tree")


p1 # print

Now we can rotate node 37 in object p1 so that the samples on node 38 move to the top. We store the rotated tree in a new object p2.

p2 <- rotate(p1, 37) + 
      ggtitle("Rotated Node 37")


p2   # print

Or we can use the flip command to rotate node 36 in object p1 and switch node 37 to the top and node 39 to the bottom. We store the flipped tree in a new object p3.

p3 <-  flip(p1, 39, 37) +
      ggtitle("Rotated Node 36")


p3   # print

Example subtree with sample data annotation

Lets say we are investigating the cluster of cases with clonal expansion which occurred in 2017 and 2018 at node 39 in our sub-tree. We add the year of strain isolation as well as travel history and color by country to see origin of other closely related strains:

ggtree(sub_tree2) %<+% sample_data +     # we use th %<+% operator to link to the sample_data
  geom_tiplab(                          # labels the tips of all branches with the sample name in the tree file
    size = 2.5,
    offset = 0.001,
    align = TRUE) + 
  theme_tree2()+
  xlim(0, 0.015)+                       # set the x-axis limits of our tree
  geom_tippoint(aes(color=Country),     # color the tip point by continent
                size = 1.5)+ 
  scale_color_brewer(
    name = "Country", 
    palette = "Set1", 
    na.value = "grey")+
  geom_tiplab(                          # add isolation year as a text label at the tips
    aes(label = Year),
    color = 'blue',
    offset = 0.0045,
    size = 3,
    linetype = "blank" ,
    geom = "text",
    align = TRUE)+ 
  geom_tiplab(                          # add travel history as a text label at the tips, in red color
    aes(label = Travel_history),
    color = 'red',
    offset = 0.006,
    size = 3,
    linetype = "blank",
    geom = "text",
    align = TRUE)+ 
  ggtitle("Phylogenetic tree of Belgian S. sonnei strains with travel history")+  # add plot title
  xlab("genetic distance (0.001 = 4 nucleotides difference)")+                    # add a label to the x-axis 
  theme(
    axis.title.x = element_text(size = 10),
    axis.title.y = element_blank(),
    legend.title = element_text(face = "bold", size = 12),
    legend.text = element_text(face = "bold", size = 10),
    plot.title = element_text(size = 12, face = "bold"))

Our observation points towards an import event of strains from Asia, which then circulated in Belgium over the years and seem to have caused our latest outbreak.

More complex trees: adding heatmaps of sample data

We can add more complex information, such as categorical presence of antimicrobial resistance genes and numeric values for actually measured resistance to antimicrobials in form of a heatmap using the ggtree::gheatmap() function.

First we need to plot our tree (this can be either linear or circular) and store it in a new ggtree plot object p: We will use the sub_tree from part 3.)

p <- ggtree(sub_tree2, branch.length='none', layout='circular') %<+% sample_data +
  geom_tiplab(size =3) + 
 theme(
   legend.position = "none",
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    plot.title = element_text(
      size = 12,
      face = "bold",
      hjust = 0.5,
      vjust = -15))
p

Second, we prepare our data. To visualize different variables with new color schemes, we subset our dataframe to the desired variable. It is important to add the Sample_ID as rownames otherwise it cannot match the data to the tree tip.labels:

In our example we want to look at gender and mutations that could confer resistance to Ciprofloxacin, an important first line antibiotic used to treat Shigella infections.

We create a dataframe for gender:

gender <- data.frame("gender" = sample_data[,c("Gender")])
rownames(gender) <- sample_data$Sample_ID

We create a dataframe for mutations in the gyrA gene, which confer Ciprofloxacin resistance:

cipR <- data.frame("cipR" = sample_data[,c("gyrA_mutations")])
rownames(cipR) <- sample_data$Sample_ID

We create a dataframe for the measured minimum inhibitory concentration (MIC) for Ciprofloxacin from the laboratory:

MIC_Cip <- data.frame("mic_cip" = sample_data[,c("MIC_CIP")])
rownames(MIC_Cip) <- sample_data$Sample_ID

We create a first plot adding a binary heatmap for gender to the phylogenetic tree and storing it in a new ggtree plot object h1:

h1 <-  gheatmap(p, gender,                                 # we add a heatmap layer of the gender dataframe to our tree plot
                offset = 10,                               # offset shifts the heatmap to the right,
                width = 0.10,                              # width defines the width of the heatmap column,
                color = NULL,                              # color defines the boarder of the heatmap columns
         colnames = FALSE) +                               # hides column names for the heatmap
  scale_fill_manual(name = "Gender",                       # define the coloring scheme and legend for gender
                    values = c("#00d1b1", "purple"),
                    breaks = c("Male", "Female"),
                    labels = c("Male", "Female")) +
   theme(legend.position = "bottom",
        legend.title = element_text(size = 12),
        legend.text = element_text(size = 10),
        legend.box = "vertical", legend.margin = margin())
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.
h1

Then we add information on mutations in the gyrA gene, which confer resistance to Ciprofloxacin:

Note: The presence of chromosomal point mutations in WGS data was prior determined using the PointFinder tool developed by Zankari et al. (see reference in the additional references section)

First, we assign a new color scheme to our existing plot object h1 and store it in a now object h2. This enables us to define and change the colors for our second variable in the heatmap.

h2 <- h1 + new_scale_fill() 

Then we add the second heatmap layer to h2 and store the combined plots in a new object h3:

h3 <- gheatmap(h2, cipR,         # adds the second row of heatmap describing Ciprofloxacin resistance mutations
               offset = 12, 
               width = 0.10, 
               colnames = FALSE) +
  scale_fill_manual(name = "Ciprofloxacin resistance \n conferring mutation",
                    values = c("#fe9698","#ea0c92"),
                    breaks = c( "gyrA D87Y", "gyrA S83L"),
                    labels = c( "gyrA d87y", "gyrA s83l")) +
   theme(legend.position = "bottom",
        legend.title = element_text(size = 12),
        legend.text = element_text(size = 10),
        legend.box = "vertical", legend.margin = margin())+
  guides(fill = guide_legend(nrow = 2,byrow = TRUE))
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.
h3

We repeat the above process, by first adding a new color scale layer to our existing object h3, and then adding the continuous data on the minimum inhibitory concentration (MIC) of Ciprofloxacin for each strain to the resulting object h4 to produce the final object h5:

# First we add the new coloring scheme:
h4 <- h3 + new_scale_fill()

# then we combine the two into a new plot:
h5 <- gheatmap(h4, MIC_Cip,  
               offset = 14, 
               width = 0.10,
                colnames = FALSE)+
  scale_fill_continuous(name = "MIC for Ciprofloxacin",  # here we define a gradient color scheme for the continuous variable of MIC
                      low = "yellow", high = "red",
                      breaks = c(0, 0.50, 1.00),
                      na.value = "white") +
   guides(fill = guide_colourbar(barwidth = 5, barheight = 1))+
   theme(legend.position = "bottom",
        legend.title = element_text(size = 12),
        legend.text = element_text(size = 10),
        legend.box = "vertical", legend.margin = margin())
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.
h5

We can do the same exercise for a linear tree:

p <- ggtree(sub_tree2) %<+% sample_data +
  geom_tiplab(size = 3) + # labels the tips
  theme_tree2()+
  xlab("genetic distance (0.001 = 4 nucleotides difference)")+
  xlim(0, 0.015)+
 theme(legend.position = "none",
      axis.title.y = element_blank(),
      plot.title = element_text(size = 12, 
                                face = "bold",
                                hjust = 0.5,
                                vjust = -15))
p

First we add gender:

h1 <-  gheatmap(p, gender, 
                offset = 0.003,
                width = 0.1, 
                color="black", 
         colnames = FALSE)+
  scale_fill_manual(name = "Gender",
                    values = c("#00d1b1", "purple"),
                    breaks = c("Male", "Female"),
                    labels = c("Male", "Female"))+
   theme(legend.position = "bottom",
        legend.title = element_text(size = 12),
        legend.text = element_text(size = 10),
        legend.box = "vertical", legend.margin = margin())
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.
h1

Then we add Ciprofloxacin resistance mutations after adding another color scheme layer:

h2 <- h1 + new_scale_fill()
h3 <- gheatmap(h2, cipR,   
               offset = 0.004, 
               width = 0.1,
               color = "black",
                colnames = FALSE)+
  scale_fill_manual(name = "Ciprofloxacin resistance \n conferring mutation",
                    values = c("#fe9698","#ea0c92"),
                    breaks = c( "gyrA D87Y", "gyrA S83L"),
                    labels = c( "gyrA d87y", "gyrA s83l"))+
   theme(legend.position = "bottom",
        legend.title = element_text(size = 12),
        legend.text = element_text(size = 10),
        legend.box = "vertical", legend.margin = margin())+
  guides(fill = guide_legend(nrow = 2,byrow = TRUE))
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.
 h3

Then we add the minimum inhibitory concentration determined by the laboratory (MIC):

h4 <- h3 + new_scale_fill()
h5 <- gheatmap(h4, MIC_Cip, 
               offset = 0.005,  
               width = 0.1,
               color = "black", 
                colnames = FALSE)+
  scale_fill_continuous(name = "MIC for Ciprofloxacin",
                      low = "yellow", high = "red",
                      breaks = c(0,0.50,1.00),
                      na.value = "white")+
   guides(fill = guide_colourbar(barwidth = 5, barheight = 1))+
   theme(legend.position = "bottom",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 8),
        legend.box = "horizontal", legend.margin = margin())+
  guides(shape = guide_legend(override.aes = list(size = 2)))
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.
h5

38.5 Resources

http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Ggtree# Clade_Colors https://bioconductor.riken.jp/packages/3.2/bioc/vignettes/ggtree/inst/doc/treeManipulation.html https://guangchuangyu.github.io/ggtree-book/chapter-ggtree.html https://bioconductor.riken.jp/packages/3.8/bioc/vignettes/ggtree/inst/doc/treeManipulation.html

Ea Zankari, Rosa Allesøe, Katrine G Joensen, Lina M Cavaco, Ole Lund, Frank M Aarestrup, PointFinder: a novel web tool for WGS-based detection of antimicrobial resistance associated with chromosomal point mutations in bacterial pathogens, Journal of Antimicrobial Chemotherapy, Volume 72, Issue 10, October 2017, Pages 2764–2768, https://doi.org/10.1093/jac/dkx217

39 Interactive plots

Data visualisation is increasingly required to be interrogable by the audience. Consequently, is is becoming common to create interactive plots. There are several ways to include these but the two most common are plotly and shiny.

In this page we will focus on converting an existing ggplot() plot into an interactive plot with plotly. You can read more about shiny in the Dashboards with Shiny page. What is worth mentioning is that interactive plots are only useable in HTML format R markdown documents, not PDF or Word documents.

Below is a basic epicurve that has been transformed to be interactive using the integration of ggplot2 and plotly (hover your mouse over the plot, zoom in, or click items in the legend).

39.1 Preparation

Load packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
  rio,       # import/export
  here,      # filepaths
  lubridate, # working with dates
  plotly,    # interactive plots
  scales,    # quick percents
  tidyverse  # data management and visualization
  ) 

Start with a ggplot()

In this page we assume that you are beginning with a ggplot() plot that you want to convert to be interactive. We will build several of these plots in this page, using the case linelist used in many pages of this handbook.

Import data

To begin, we import the cleaned linelist of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import data with the import() function from the rio package (it handles many file types like .xlsx, .csv, .rds - see the Import and export page for details).

# import case linelist 
linelist <- import("linelist_cleaned.rds")

The first 50 rows of the linelist are displayed below.

39.2 Plot with ggplotly()

The function ggplotly() from the plotly package makes it easy to convert a ggplot() to be interactive. Simply save your ggplot() and then pipe it to the ggplotly() function.

Below, we plot a simple line representing the proportion of cases who died in a given week:

We begin by creating a summary dataset of each epidemiological week, and the percent of cases with a known outcome that died.

weekly_deaths <- linelist %>%
  group_by(epiweek = floor_date(date_onset, "week")) %>%  # create and group data by epiweek column
  summarise(                                              # create new summary data frame:
    n_known_outcome = sum(!is.na(outcome), na.rm=T),      # number of cases per group with known outcome
    n_death  = sum(outcome == "Death", na.rm=T),          # number of cases per group who died
    pct_death = 100*(n_death / n_known_outcome)           # percent of cases with known outcome who died
  )

Here is the first 50 rows of the weekly_deaths dataset.

Then we create the plot with ggplot2, using geom_line().

deaths_plot <- ggplot(data = weekly_deaths)+            # begin with weekly deaths data
  geom_line(mapping = aes(x = epiweek, y = pct_death))  # make line 

deaths_plot   # print

We can make this interactive by simply passing this plot to ggplotly(), as below. Hover your mouse over the line to show the x and y values. You can zoom in on the plot, and drag it around. You can also see icons in the upper-right of the plot. In order, they allow you to:

  • Download the current view as a PNG image
  • Zoom in with a select box
  • “Pan”, or move across the plot by clicking and dragging the plot
  • Zoom in, zoom out, or return to default zoom
  • Reset axes to defaults
  • Toggle on/off “spike lines” which are dotted lines from the interactive point extending to the x and y axes
  • Adjustments to whether data show when you are not hovering on the line
deaths_plot %>% plotly::ggplotly()

Grouped data work with ggplotly() as well. Below, a weekly epicurve is made, grouped by outcome. The stacked bars are interactive. Try clicking on the different items in the legend (they will appear/disappear).

# Make epidemic curve with incidence2 pacakge
p <- incidence2::incidence(
  linelist,
  date_index = date_onset,
  interval = "weeks",
  groups = outcome) %>% plot(fill = outcome)
# Plot interactively  
p %>% plotly::ggplotly()

39.3 Modifications

File size

When exporting in an R Markdown generated HTML (like this book!) you want to make the plot as small data size as possible (with no negative side effects in most cases). For this, just pipe the interactive plot to partial_bundle(), also from plotly.

p <- p %>% 
  plotly::ggplotly() %>%
  plotly::partial_bundle()

Buttons

Some of the buttons on a standard plotly are superfluous and can be distracting, so you can remove them. You can do this simply by piping the output into config() from plotly and specifying which buttons to remove. In the below example we specify in advance the names of the buttons to remove, and provide them to the argument modeBarButtonsToRemove =. We also set displaylogo = FALSE to remove the plotly logo.

## these buttons are distracting and we want to remove them
plotly_buttons_remove <- list('zoom2d','pan2d','lasso2d', 'select2d','zoomIn2d',
                              'zoomOut2d','autoScale2d','hoverClosestCartesian',
                              'toggleSpikelines','hoverCompareCartesian')

p <- p %>%          # re-define interactive plot without these buttons
  plotly::config(displaylogo = FALSE, modeBarButtonsToRemove = plotly_buttons_remove)

39.4 Heat tiles

You can make almost any ggplot() plot interactive, including heat tiles. In the page on Heat plots you can read about how to make the below plot, which displays the proportion of days per week that certain facilities reported data to their province.

Here is the code, although we will not describe it in depth here.

# import data
facility_count_data <- rio::import(here::here("data", "malaria_facility_count_data.rds"))

# aggregate data into Weeks for Spring district
agg_weeks <- facility_count_data %>% 
  filter(District == "Spring",
         data_date < as.Date("2020-08-01")) %>% 
  mutate(week = aweek::date2week(
    data_date,
    start_date = "Monday",
    floor_day = TRUE,
    factor = TRUE)) %>% 
  group_by(location_name, week, .drop = F) %>%
  summarise(
    n_days          = 7,
    n_reports       = n(),
    malaria_tot     = sum(malaria_tot, na.rm = T),
    n_days_reported = length(unique(data_date)),
    p_days_reported = round(100*(n_days_reported / n_days))) %>% 
  right_join(tidyr::expand(., week, location_name)) %>% 
  mutate(week = aweek::week2date(week))

# create plot
metrics_plot <- ggplot(agg_weeks,
       aes(x = week,
           y = location_name,
           fill = p_days_reported))+
  geom_tile(colour="white")+
  scale_fill_gradient(low = "orange", high = "darkgreen", na.value = "grey80")+
  scale_x_date(expand = c(0,0),
               date_breaks = "2 weeks",
               date_labels = "%d\n%b")+
  theme_minimal()+ 
  theme(
    legend.title = element_text(size=12, face="bold"),
    legend.text  = element_text(size=10, face="bold"),
    legend.key.height = grid::unit(1,"cm"),
    legend.key.width  = grid::unit(0.6,"cm"),
    axis.text.x = element_text(size=12),
    axis.text.y = element_text(vjust=0.2),
    axis.ticks = element_line(size=0.4),
    axis.title = element_text(size=12, face="bold"),
    plot.title = element_text(hjust=0,size=14,face="bold"),
    plot.caption = element_text(hjust = 0, face = "italic")
    )+
  labs(x = "Week",
       y = "Facility name",
       fill = "Reporting\nperformance (%)",
       title = "Percent of days per week that facility reported data",
       subtitle = "District health facilities, April-May 2019",
       caption = "7-day weeks beginning on Mondays.")

metrics_plot # print

Below, we make it interactive and modify it for simple buttons and file size.

metrics_plot %>% 
  plotly::ggplotly() %>% 
  plotly::partial_bundle() %>% 
  plotly::config(displaylogo = FALSE, modeBarButtonsToRemove = plotly_buttons_remove)

–>

39.5 Resources

Plotly is not just for R, but also works well with Python (and really any data science language as it’s built in JavaScript). You can read more about it on the plotly website

(PART) Reports and dashboards

40 Reports with R Markdown

R Markdown is a widely-used tool for creating automated, reproducible, and share-worthy outputs, such as reports. It can generate static or interactive outputs, in Word, pdf, html, powerpoint, and other formats.

An R Markdown script intersperces R code and text such that the script actually becomes your output document. You can create an entire formatted document, including narrative text (can be dynamic to change based on your data), tables, figures, bullets/numbers, bibliographies, etc.

Such documents can be produced to update on a routine basis (e.g. daily surveillance reports) and/or run on subsets of data (e.g. reports for each jurisdiction).

Other pages in this handbook expand on this topic:

Of note, the R4Epis project has developed template R Markdown scripts for common outbreaks and surveys scenarios encountered at MSF project locations.

40.1 Preparation

Background to R Markdown

To explain some of the concepts and packages involved:

  • Markdown is a “language” that allows you to write a document using plain text, that can be converted to html and other formats. It is not specific to R. Files written in Markdown have a ‘.md’ extension.
  • R Markdown: is a variation on markdown that is specific to R - it allows you to write a document using markdown to produce text and to embed R code and display their outputs. R Markdown files have ‘.Rmd’ extension.
  • rmarkdown - the package: This is used by R to render the .Rmd file into the desired output. It’s focus is converting the markdown (text) syntax, so we also need…
  • knitr: This R package will read the code chunks, execute it, and ‘knit’ it back into the document. This is how tables and graphs are included alongside the text.
  • Pandoc: Finally, pandoc actually convert the output into word/pdf/powerpoint etc. It is a software separate from R but is installed automatically with RStudio.

In sum, the process that happens in the background (you do not need to know all these steps!) involves feeding the .Rmd file to knitr, which executes the R code chunks and creates a new .md (markdown) file which includes the R code and its rendered output. The .md file is then processed by pandoc to create the finished product: a Microsoft Word document, HTML file, powerpoint document, pdf, etc.

(source: https://rmarkdown.rstudio.com/authoring_quick_tour.html):

Installation

To create a R Markdown output, you need to have the following installed:

  • The rmarkdown package (knitr will also be installed automatically)
  • Pandoc, which should come installed with RStudio. If you are not using RStudio, you can download Pandoc here: http://pandoc.org.
  • If you want to generate PDF output (a bit trickier), you will need to install LaTeX. For R Markdown users who have not installed LaTeX before, we recommend that you install TinyTeX (https://yihui.name/tinytex/). You can use the following commands:
pacman::p_load(tinytex)     # install tinytex package
tinytex::install_tinytex()  # R command to install TinyTeX software

40.2 Getting started

Install rmarkdown R package

Install the rmarkdown R package. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(rmarkdown)

Starting a new Rmd file

In RStudio, open a new R markdown file, starting with ‘File’, then ‘New file’ then ‘R markdown…’.

R Studio will give you some output options to pick from. In the example below we select “HTML” because we want to create an html document. The title and the author names are not important. If the output document type you want is not one of these, don’t worry - you can just pick any one and change it in the script later.

This will open up a new .Rmd script.

Important to know

The working directory

The working directory of a markdown file is wherever the Rmd file itself is saved. For instance, if the R project is within ~/Documents/projectX and the Rmd file itself is in a subfolder ~/Documents/projectX/markdownfiles/markdown.Rmd, the code read.csv(“data.csv”) within the markdown will look for a csv file in the markdownfiles folder, and not the root project folder where scripts within projects would normally automatically look.

To refer to files elsewhere, you will either need to use the full file path or use the here package. The here package sets the working directory to the root folder of the R project and is explained in detail in the R projects and Import and export pages of this handbook. For instance, to import a file called “data.csv” from within the projectX folder, the code would be import(here(“data.csv”)).

Note that use of setwd() in R Markdown scripts is not recommended – it only applies to the code chunk that it is written in.

Working on a drive vs your computer

Because R Markdown can run into pandoc issues when running on a shared network drive, it is recommended that your folder is on your local machine, e.g. in a project within ‘My Documents’. If you use Git (much recommended!), this will be familiar. For more details, see the handbook pages on R on network drives and [Errors and help].

40.3 R Markdown components

An R Markdown document can be edited in RStudio just like a standard R script. When you start a new R Markdown script, RStudio tries to be helpful by showing a template which explains the different section of an R Markdown script.

The below is what appears when starting a new Rmd script intended to produce an html output (as per previous section).

As you can see, there are three basic components to an Rmd file: YAML, Markdown text, and R code chunks.

These will create and become your document output. See the diagram below:

YAML metadata

Referred to as the ‘YAML metadata’ or just ‘YAML’, this is at the top of the R Markdown document. This section of the script will tell your Rmd file what type of output to produce, formatting preferences, and other metadata such as document title, author, and date. There are other uses not mentioned here (but referred to in ‘Producing an output’). Note that indentation matters; tabs are not accepted but spaces are.

This section must begin with a line containing just three dashes --- and must close with a line containing just three dashes ---. YAML parameters comes in key:value pairs. The placement of colons in YAML is important - the key:value pairs are separated by colons (not equals signs!).

The YAML should begin with metadata for the document. The order of these primary YAML parameters (not indented) does not matter. For example:

title: "My document"
author: "Me"
date: "2021-08-31"

You can use R code in YAML values by writing it as in-line code (preceded by r within back-ticks) but also within quotes (see above example for date:).

In the image above, because we clicked that our default output would be an html file, we can see that the YAML says output: html_document. However we can also change this to say powerpoint_presentation or word_document or even pdf_document.

Text

This is the narrative of your document, including the titles and headings. It is written in the “markdown” language, which is used across many different software.

Below are the core ways to write this text. See more extensive documentation available on R Markdown “cheatsheet” at the RStudio website.

New lines

Uniquely in R Markdown, to initiate a new line, enter *two spaces** at the end of the previous line and then Enter/Return.

Case

Surround your normal text with these character to change how it appears in the output.

  • Underscores (_text_) or single asterisk (*text*) to italicise
  • Double asterisks (**text**) for bold text
  • Back-ticks (text) to display text as code

The actual appearance of the font can be set by using specific templates (specified in the YAML metadata; see example tabs).

Color

There is no simple mechanism to change the color of text in R Markdown. One work-around, IF your output is an HTML file, is to add an HTML line into the markdown text. The below HTML code will print a line of text in bold red.

<span style="color: red;">**_DANGER:_** This is a warning.</span>  

DANGER: This is a warning.

Titles and headings

A hash symbol in a text portion of a R Markdown script creates a heading. This is different than in a chunk of R code in the script, in which a hash symbol is a mechanism to comment/annotate/de-activate, as in a normal R script.

Different heading levels are established with different numbers of hash symbols at the start of a new line. One hash symbol is a title or primary heading. Two hash symbols are a second-level heading. Third- and fourth-level headings can be made with successively more hash symbols.

# First-level heading / title

## Second level heading  

### Third-level heading

Bullets and numbering

Use asterisks (*) to created a bullets list. Finish the previous sentence, enter two spaces, Enter/Return twice, and then start your bullets. Include a space between the asterisk and your bullet text. After each bullet enter two spaces and then Enter/Return. Sub-bullets work the same way but are indented. Numbers work the same way but instead of an asterisk, write 1), 2), etc. Below is how your R Markdown script text might look.

Here are my bullets (there are two spaces after this colon):  

* Bullet 1 (followed by two spaces and Enter/Return)  
* Bullet 2 (followed by two spaces and Enter/Return)  
  * Sub-bullet 1 (followed by two spaces and Enter/Return)  
  * Sub-bullet 2 (followed by two spaces and Enter/Return)  
  

Comment out text

You can “comment out” R Markdown text just as you can use the “#” to comment out a line of R code in an R chunk. Simply highlight the text and press Ctrl+Shift+c (Cmd+Shift+c for Mac). The text will be surrounded by arrows and turn green. It will not appear in your output.

Code chunks

Sections of the script that are dedicated to running R code are called “chunks”. This is where you may load packages, import data, and perform the actual data management and visualisation. There may be many code chunks, so they can help you organize your R code into parts, perhaps interspersed with text. To note: These ‘chunks’ will appear to have a slightly different background colour from the narrative part of the document.

Each chunk is opened with a line that starts with three back-ticks, and curly brackets that contain parameters for the chunk ({ }). The chunk ends with three more back-ticks.

You can create a new chunk by typing it out yourself, by using the keyboard shortcut “Ctrl + Alt + i” (or Cmd + Shift + r in Mac), or by clicking the green ‘insert a new code chunk’ icon at the top of your script editor.

Some notes about the contents of the curly brackets { }:

  • They start with ‘r’ to indicate that the language name within the chunk is R
  • After the r you can optionally write a chunk “name” – these are not necessary but can help you organise your work. Note that if you name your chunks, you should ALWAYS use unique names or else R will complain when you try to render.
  • The curly brackets can include other options too, written as tag=value, such as:
  • eval = FALSE to not run the R code
  • echo = FALSE to not print the chunk’s R source code in the output document
  • warning = FALSE to not print warnings produced by the R code
  • message = FALSE to not print any messages produced by the R code
  • include = either TRUE/FALSE whether to include chunk outputs (e.g. plots) in the document
  • out.width = and out.height = - provide in style out.width = "75%"
  • fig.align = "center" adjust how a figure is aligned across the page
  • fig.show='hold' if your chunk prints multiple figures and you want them printed next to each other (pair with out.width = c("33%", "67%"). Can also set as fig.show='asis' to show them below the code that generates them, 'hide' to hide, or 'animate' to concatenate multiple into an animation.
  • A chunk header must be written in one line
  • Try to avoid periods, underscores, and spaces. Use hyphens ( - ) instead if you need a separator.

Read more extensively about the knitr options here.

Some of the above options can be configured with point-and-click using the setting buttons at the top right of the chunk. Here, you can specify which parts of the chunk you want the rendered document to include, namely the code, the outputs, and the warnings. This will come out as written preferences within the curly brackets, e.g. echo=FALSE if you specify you want to ‘Show output only’.

There are also two arrows at the top right of each chunk, which are useful to run code within a chunk, or all code in prior chunks. Hover over them to see what they do.

For global options to be applied to all chunks in the script, you can set this up within your very first R code chunk in the script. For instance, so that only the outputs are shown for each code chunk and not the code itself, you can include this command in the R code chunk:

knitr::opts_chunk$set(echo = FALSE) 

In-text R code

You can also include minimal R code within back-ticks. Within the back-ticks, begin the code with “r” and a space, so RStudio knows to evaluate the code as R code. See the example below.

The example below shows multiple heading levels, bullets, and uses R code for the current date (Sys.Date()) to evaluate into a printed date.

The example above is simple (showing the current date), but using the same syntax you can display values produced by more complex R code (e.g. to calculate the min, median, max of a column). You can also integrate R objects or values that were created in R code chunks earlier in the script.

As an example, the script below calculates the proportion of cases that are aged less than 18 years old, using tidyverse functions, and creates the objects less18, total, and less18prop. This dynamic value is inserted into subsequent text. We see how it looks when knitted to a word document.

Images

You can include images in your R Markdown one of two ways:

![]("path/to/image.png")  

If the above does not work, try using knitr::include_graphics()

knitr::include_graphics("path/to/image.png")

(remember, your file path could be written using the here package)

knitr::include_graphics(here::here("path", "to", "image.png"))

Tables

Create a table using hyphens ( - ) and bars ( | ). The number of hyphens before/between bars allow the number of spaces in the cell before the text begins to wrap.

Column 1 |Column  2 |Column 3
---------|----------|--------
Cell A   |Cell B    |Cell C
Cell D   |Cell E    |Cell F

The above code produces the table below:

Column 1 Column 2 Column 3
Cell A Cell B Cell C
Cell D Cell E Cell F

Tabbed sections

For HTML outputs, you can arrange the sections into “tabs”. Simply add .tabset in the curly brackets { } that are placed after a heading. Any sub-headings beneath that heading (until another heading of the same level) will appear as tabs that the user can click through. Read more here

You can add an additional option .tabset-pills after .tabset to give the tabs themselves a “pilled” appearance. Be aware that when viewing the tabbed HTML output, the Ctrl+f search functionality will only search “active” tabs, not hidden tabs.

40.4 File structure

There are several ways to structure your R Markdown and any associated R scripts. Each has advantages and disadvantages:

  • Self-contained R Markdown - everything needed for the report is imported or created within the R Markdown
    • Source other files - You can run external R scripts with the source() command and use their outputs in the Rmd
    • Child scripts - an alternate mechanism for source()
  • Utilize a “runfile” - Run commands in an R script prior to rendering the R Markdown

Self-contained Rmd

For a relatively simple report, you may elect to organize your R Markdown script such that it is “self-contained” and does not involve any external scripts.

Everything you need to run the R markdown is imported or created within the Rmd file, including all the code chunks and package loading. This “self-contained” approach is appropriate when you do not need to do much data processing (e.g. it brings in a clean or semi-clean data file) and the rendering of the R Markdown will not take too long.

In this scenario, one logical organization of the R Markdown script might be:

  1. Set global knitr options
  2. Load packages
  3. Import data
  4. Process data
  5. Produce outputs (tables, plots, etc.)
  6. Save outputs, if applicable (.csv, .png, etc.)

Source other files

One variation of the “self-contained” approach is to have R Markdown code chunks “source” (run) other R scripts. This can make your R Markdown script less cluttered, more simple, and easier to organize. It can also help if you want to display final figures at the beginning of the report. In this approach, the final R Markdown script simply combines pre-processed outputs into a document.

One way to do this is by providing the R scripts (file path and name with extension) to the base R command source().

source("your-script.R", local = knitr::knit_global())
# or sys.source("your-script.R", envir = knitr::knit_global())

Note that when using source() within the R Markdown, the external files will still be run during the course of rendering your Rmd file. Therefore, each script is run every time you render the report. Thus, having these source() commands within the R Markdown does not speed up your run time, nor does it greatly assist with de-bugging, as error produced will still be printed when producing the R Markdown.

An alternative is to utilize the child = knitr option. EXPLAIN MORE TO DO

You must be aware of various R environments. Objects created within an environment will not necessarily be available to the environment used by the R Markdown.

Runfile

This approach involves utilizing the R script that contains the render() command(s) to pre-process objects that feed into the R markdown.

For instance, you can load the packages, load and clean the data, and even create the graphs of interest prior to render(). These steps can occur in the R script, or in other scripts that are sourced. As long as these commands occur in the same RStudio session and objects are saved to the environment, the objects can then be called within the Rmd content. Then the R markdown itself will only be used for the final step - to produce the output with all the pre-processed objects. This is much easier to de-bug if something goes wrong.

This approach is helpful for the following reasons:

  • More informative error messages - these messages will be generated from the R script, not the R Markdown. R Markdown errors tend to tell you which chunk had a problem, but will not tell you which line.
  • If applicable, you can run long processing steps in advance of the render() command - they will run only once.

In the example below, we have a separate R script in which we pre-process a data object into the R Environment and then render the “create_output.Rmd” using render().

data <- import("datafile.csv") %>%       # Load data and save to environment
  select(age, hospital, weight)          # Select limited columns

rmarkdown::render(input = "create_output.Rmd")   # Create Rmd file

Folder strucutre

Workflow also concerns the overall folder structure, such as having an ‘output’ folder for created documents and figures, and ‘data’ or ‘inputs’ folders for cleaned data. We do not go into further detail here, but check out the Organizing routine reports page.

40.5 Producing the document

You can produce the document in the following ways:

  • Manually by pressing the “Knit” button at the top of the RStudio script editor (fast and easy)
  • Run the render() command (executed outside the R Markdown script)

Option 1: “Knit” button

When you have the Rmd file open, press the ‘Knit’ icon/button at the top of the file.

R Studio will you show the progress within an ‘R Markdown’ tab near your R console. The document will automatically open when complete.

The document will be saved in the same folder as your R markdown script, and with the same file name (aside from the extension). This is obviously not ideal for version control (it will be over-written each tim you knit, unless moved manually), as you may then need to rename the file yourself (e.g. add a date).

This is RStudio’s shortcut button for the render() function from rmarkdown. This approach only compatible with a self-contained R markdown, where all the needed components exist or are sourced within the file.

Option 2: render() command

Another way to produce your R Markdown output is to run the render() function (from the rmarkdown package). You must execute this command outside the R Markdown script - so either in a separate R script (often called a “run file”), or as a stand-alone command in the R Console.

rmarkdown::render(input = "my_report.Rmd")

As with “knit”, the default settings will save the Rmd output to the same folder as the Rmd script, with the same file name (aside from the file extension). For instance “my_report.Rmd” when knitted will create “my_report.docx” if you are knitting to a word document. However, by using render() you have the option to use different settings. render() can accept arguments including:

  • output_format = This is the output format to convert to (e.g. "html_document", "pdf_document", "word_document", or "all"). You can also specify this in the YAML inside the R Markdown script.
  • output_file = This is the name of the output file (and file path). This can be created via R functions like here() or str_glue() as demonstrated below.
  • output_dir = This is an output directory (folder) to save the file. This allows you to chose an alternative other than the directory the Rmd file is saved to.
  • output_options = You can provide a list of options that will override those in the script YAML (e.g. )
  • output_yaml = You can provide path to a .yml file that contains YAML specifications
  • params = See the section on parameters below
  • See the complete list here

As one example, to improve version control, the following command will save the output file within an ‘outputs’ sub-folder, with the current date in the file name. To create the file name, the function str_glue() from the stringr package is use to ‘glue’ together static strings (written plainly) with dynamic R code (written in curly brackets). For instance if it is April 10th 2021, the file name from below will be “Report_2021-04-10.docx”. See the page on Characters and strings for more details on str_glue().

rmarkdown::render(
  input = "create_output.Rmd",
  output_file = stringr::str_glue("outputs/Report_{Sys.Date()}.docx")) 

As the file renders, the RStudio Console will show you the rendering progress up to 100%, and a final message to indicate that the rendering is complete.

Options 3: reportfactory package

The R package reportfactory offers an alternative method of organising and compiling R Markdown reports catered to scenarios where you run reports routinely (e.g. daily, weekly…). It eases the compilation of multiple R Markdown files and the organization of their outputs. In essence, it provides a “factory” from which you can run the R Markdown reports, get automatically date- and time-stamped folders for the outputs, and have “light” version control.

Read more about this work flow in the page on Organizing routine reports.

40.6 Parameterised reports

You can use parameterisation to make a report dynamic, such that it can be run with specific setting (e.g. a specific date or place or with certain knitting options). Below, we focus on the basics, but there is more detail online about parameterized reports.

Using the Ebola linelist as an example, let’s say we want to run a standard surveillance report for each hospital each day. We show how one can do this using parameters.

Important: dynamic reports are also possible without the formal parameter structure (without params:), using simple R objects in an adjacent R script. This is explained at the end of this section.

Setting parameters

You have several options for specifying parameter values for your R Markdown output.

Option 1: Set parameters within YAML

Edit the YAML to include a params: option, with indented statements for each parameter you want to define. In this example we create parameters date and hospital, for which we specify values. These values are subject to change each time the report is run. If you use the “Knit” button to produce the output, the parameters will have these default values. Likewise, if you use render() the parameters will have these default values unless otherwise specified in the render() command.

---
title: Surveillance report
output: html_document
params:
 date: 2021-04-10
 hospital: Central Hospital
---

In the background, these parameter values are contained within a read-only list called params. Thus, you can insert the parameter values in R code as you would another R object/value in your environment. Simply type params$ followed by the parameter name. For example params$hospital to represent the hospital name (“Central Hospital” by default).

Note that parameters can also hold values true or false, and so these can be included in your knitr options for a R chunk. For example, you can set {r, eval=params$run} instead of {r, eval=FALSE}, and now whether the chunk runs or not depends on the value of a parameter run:.

Note that for parameters that are dates, they will be input as a string. So for params$date to be interpreted in R code it will likely need to be wrapped with as.Date() or a similar function to convert to class Date.

Option 2: Set parameters within render()

As mentioned above, as alternative to pressing the “Knit” button to produce the output is to execute the render() function from a separate script. In this later case, you can specify the parameters to be used in that rendering to the params = argument of render().

Note than any parameter values provided here will overwrite their default values if written within the YAML. We write the values in quotation marks as in this case they should be defined as character/string values.

The below command renders “surveillance_report.Rmd”, specifies a dynamic output file name and folder, and provides a list() of two parameters and their values to the argument params =.

rmarkdown::render(
  input = "surveillance_report.Rmd",  
  output_file = stringr::str_glue("outputs/Report_{Sys.Date()}.docx"),
  params = list(date = "2021-04-10", hospital  = "Central Hospital"))

Option 3: Set parameters using a Graphical User Interface

For a more interactive feel, you can also use the Graphical User Interface (GUI) to manually select values for parameters. To do this we can click the drop-down menu next to the ‘Knit’ button and choose ‘Knit with parameters’.

A pop-up will appear allowing you to type in values for the parameters that are established in the document’s YAML.

You can achieve the same through a render() command by specifying params = "ask", as demonstrated below.

rmarkdown::render(
  input = "surveillance_report.Rmd",  
  output_file = stringr::str_glue("outputs/Report_{Sys.Date()}.docx"),
  params = “ask”)

However, typing values into this pop-up window is subject to error and spelling mistakes. You may prefer to add restrictions to the values that can be entered through drop-down menus. You can do this by adding in the YAML several specifications for each params: entry.

  • label: is how the title for that particular drop-down menu
  • value: is the default (starting) value
  • input: set to select for drop-down menu
  • choices: Give the eligible values in the drop-down menu

Below, these specifications are written for the hospital parameter.

---
title: Surveillance report
output: html_document
params:
 date: 2021-04-10
 hospital: 
  label: “Town:”
  value: Central Hospital
  input: select
  choices: [Central Hospital, Military Hospital, Port Hospital, St. Mark's Maternity Hospital (SMMH)]
---

When knitting (either via the ‘knit with parameters’ button or by render()), the pop-up window will have drop-down options to select from.

Parameterized example

The following code creates parameters for date and hospital, which are used in the R Markdown as params$date and params$hospital, respectively.

In the resulting report output, see how the data are filtered to the specific hospital, and the plot title refers to the correct hospital and date. We use the “linelist_cleaned.rds” file here, but it would be particularly appropriate if the linelist itself also had a datestamp within it to align with parameterised date.

Knitting this produces the final output with the default font and layout.

Parameterisation without params

If you are rendering a R Markdown file with render() from a separate script, you can actually create the impact of parameterization without using the params: functionality.

For instance, in the R script that contains the render() command, you can simply define hospital and date as two R objects (values) before the render() command. In the R Markdown, you would not need to have a params: section in the YAML, and we would refer to the date object rather than params$date and hospital rather than params$hospital.

# This is a R script that is separate from the R Markdown

# define R objects
hospital <- "Central Hospital"
date <- "2021-04-10"

# Render the R markdown
rmarkdown::render(input = "create_output.Rmd") 

Following this approach means means you can not “knit with parameters”, use the GUI, or include knitting options within the parameters. However it allows for simpler code, which may be advantageous.

40.7 Looping reports

We may want to run a report multiple times, varying the input parameters, to produce a report for each jurisdictions/unit. This can be done using tools for iteration, which are explained in detail in the page on Iteration, loops, and lists. Options include the purrr package, or use of a for loop as explained below.

Below, we use a simple for loop to generate a surveillance report for all hospitals of interest. This is done with one command (instead of manually changing the hospital parameter one-at-a-time). The command to render the reports must exist in a separate script outside the report Rmd. This script will also contain defined objects to “loop through” - today’s date, and a vector of hospital names to loop through.

hospitals <- c("Central Hospital",
                "Military Hospital", 
                "Port Hospital",
                "St. Mark's Maternity Hospital (SMMH)") 

We then feed these values one-at-a-time into the render() command using a loop, which runs the command once for each value in the hospitals vector. The letter i represents the index position (1 through 4) of the hospital currently being used in that iteration, such that hospital_list[1] would be “Central Hospital”. This information is supplied in two places in the render() command:

  1. To the file name, such that the file name of the first iteration if produced on 10th April 2021 would be “Report_Central Hospital_2021-04-10.docx”, saved in the ‘output’ subfolder of the working directory.
  2. To params = such that the Rmd uses the hospital name internally whenever the params$hospital value is called (e.g. to filter the dataset to the particular hospital only). In this example, four files would be created - one for each hospital.
for(i in 1:length(hospitals)){
  rmarkdown::render(
    input = "surveillance_report.Rmd",
    output_file = str_glue("output/Report_{hospitals[i]}_{Sys.Date()}.docx"),
    params = list(hospital  = hospitals[i]))
}       

40.8 Templates

By using a template document that contains any desired formatting, you can adjust the aesthetics of how the Rmd output will look. You can create for instance an MS Word or Powerpoint file that contains pages/slides with the desired dimensions, watermarks, backgrounds, and fonts.

Word documents

To create a template, start a new word document (or use an existing output with formatting the suits you), and edit fonts by defining the Styles. In Style,Headings 1, 2, and 3 refer to the various markdown header levels (# Header 1, ## Header 2 and ### Header 3 respectively). Right click on the style and click ‘modify’ to change the font formatting as well as the paragraph (e.g. you can introduce page breaks before certain styles which can help with spacing). Other aspects of the word document such as margins, page size, headers etc, can be changed like a usual word document you are working directly within.

Powerpoint documents

As above, create a new slideset or use an existing powerpoint file with the desired formatting. For further editing, click on ‘View’ and ‘Slide Master’. From here you can change the ‘master’ slide appearance by editing the text formatting in the text boxes, as well as the background/page dimensions for the overall page.

Unfortunately, editing powerpoint files is slightly less flexible:

  • A first level header (# Header 1) will automatically become the title of a new slide,
  • A ## Header 2 text will not come up as a subtitle but text within the slide’s main textbox (unless you find a way to maniuplate the Master view).
  • Outputted plots and tables will automatically go into new slides. You will need to combine them, for instance the the patchwork function to combine ggplots, so that they show up on the same page. See this blog post about using the patchwork package to put multiple images on one slide.

See the officer package for a tool to work more in-depth with powerpoint presentations.

Integrating templates into the YAML

Once a template is prepared, the detail of this can be added in the YAML of the Rmd underneath the ‘output’ line and underneath where the document type is specified (which goes to a separate line itself). Note reference_doc can be used for powerpoint slide templates.

It is easiest to save the template in the same folder as where the Rmd file is (as in the example below), or in a subfolder within.

---
title: Surveillance report
output: 
 word_document:
  reference_docx: "template.docx"
params:
 date: 2021-04-10
 hospital: Central Hospital
template:
 
---

Formatting HTML files

HTML files do not use templates, but can have the styles configured within the YAML. HTMLs are interactive documents, and are particularly flexible. We cover some basic options here.

  • Table of contents: We can add a table of contents with toc: true below, and also specify that it remains viewable (“floats”) as you scroll, with toc_float: true.

  • Themes: We can refer to some pre-made themes, which come from a Bootswatch theme library. In the below example we use cerulean. Other options include: journal, flatly, darkly, readable, spacelab, united, cosmo, lumen, paper, sandstone, simplex, and yeti.

  • Highlight: Configuring this changes the look of highlighted text (e.g. code within chunks that are shown). Supported styles include default, tango, pygments, kate, monochrome, espresso, zenburn, haddock, breezedark, and textmate.

Here is an example of how to integrate the above options into the YAML.

---
title: "HTML example"
output:
  html_document:
    toc: true
    toc_float: true
    theme: cerulean
    highlight: kate
    
---

Below are two examples of HTML outputs which both have floating tables of contents, but different theme and highlight styles selected:

40.9 Dynamic content

In an HTML output, your report content can be dynamic. Below are some examples:

Tables

In an HTML report, you can print data frame / tibbles such that the content is dynamic, with filters and scroll bars. There are several packages that offer this capability.

To do this with the DT package, as is used throughout this handbook, you can insert a code chunk like this:

The function datatable() will print the provided data frame as a dynamic table for the reader. You can set rownames = FALSE to simplify the far left-side of the table. filter = "top" provides a filter over each column. In the option() argument provide a list of other specifications. Below we include two: pageLength = 5 set the number of rows that appear as 5 (the remaining rows can be viewed by paging through arrows), and scrollX=TRUE enables a scrollbar on the bottom of the table (for columns that extend too far to the right).

If your dataset is very large, consider only showing the top X rows by wrapping the data frame in head().

HTML widgets

HTML widgets for R are a special class of R packages that enable increased interactivity by utilizing JavaScript libraries. You can embed them in HTML R Markdown outputs.

Some common examples of these widgets include:

  • Plotly (used in this handbook page and in the [Interative plots] page)
  • visNetwork (used in the Transmission Chains page of this handbook)
  • Leaflet (used in the GIS Basics page of this handbook)
  • dygraphs (useful for interactively showing time series data)
  • DT (datatable()) (used to show dynamic tables with filter, sort, etc.)

The ggplotly() function from plotly is particularly easy to use. See the Interactive plots page.

41 Organizing routine reports

This page covers the reportfactory package, which is an accompaniment to using R Markdown for reports.

In scenarios where you run reports routinely (daily, weekly, etc.), it eases the compilation of multiple R Markdown files and the organization of their outputs. In essence, it provides a “factory” from which you can run the R Markdown reports, get automatically date- and time-stamped folders for the outputs, and have “light” version control.

reportfactory is one of the packages developed by RECON (R Epidemics Consortium). Here is their website and Github.

41.1 Preparation

Load packages

From within RStudio, install the latest version of the reportfactory package from Github.

You can do this via the pacman package with p_load_current_gh() which will force intall of the latest version from Github. Provide the character string “reconverse/reportfactory”, which specifies the Github organization (reconverse) and repository (reportfactory). You can also use install_github() from the remotes package, as an alternative.

# Install and load the latest version of the package from Github
pacman::p_load_current_gh("reconverse/reportfactory")
#remotes::install_github("reconverse/reportfactory") # alternative

41.2 New factory

To create a new factory, run the function new_factory(). This will create a new self-contained R project folder. By default:

  • The factory will be added to your working directory
  • The name of the factory R project will be called “new_factory.Rproj”
  • Your RStudio session will “move in” to this R project
# This will create the factory in the working directory
new_factory()

Looking inside the factory, you can see that sub-folders and some files were created automatically.

  • The report_sources folder will hold your R Markdown scripts, which generate your reports
  • The outputs folder will hold the report outputs (e.g. HTML, Word, PDF, etc.)
  • The scripts folder can be used to store other R scripts (e.g. that are sourced by your Rmd scripts)
  • The data folder can be used to hold your data (“raw” and “clean” subfolders are included)
  • A .here file, so you can use the here package to call files in sub-folders by their relation to this root folder (see R projects page for details)
  • A gitignore file was created in case you link this R project to a Github repository (see [Version control and collaboration with Github])
  • An empty README file, for if you use a Github repository

CAUTION: depending on your computer’s setting, files such as “.here” may exist but be invisible.

Of the default settings, below are several that you might want to adjust within the new_factory() command:

  • factory = - Provide a name for the factory folder (default is “new_factory”)
  • path = - Designate a file path for the new factory (default is the working directory)
  • report_sources = Provide an alternate name for the subfolder which holds the R Markdown scripts (default is “report_sources”)
  • outputs = Provide an alternate name for the folder which holds the report outputs (default is “outputs”)

See ?new_factory for a complete list of the arguments.

When you create the new factory, your R session is transferred to the new R project, so you should again load the reportfactory package.

pacman::p_load(reportfactory)

Now you can run a the factory_overview() command to see the internal structure (all folders and files) in the factory.

factory_overview()            # print overview of the factory to console

The following “tree” of the factory’s folders and files is printed to the R console. Note that in the “data” folder there are sub-folders for “raw” and “clean” data, and example CSV data. There is also “example_report.Rmd” in the “report_sources” folder.

41.3 Create a report

From within the factory R project, create a R Markdown report just as you would normally, and save it into the “report_sources” folder. See the R Markdown page for instructions. For purposes of example, we have added the following to the factory:

  • A new R markdown script entitled “daily_sitrep.Rmd”, saved within the “report_sources” folder
  • Data for the report (“linelist_cleaned.rds”), saved to the “clean” sub-folder within the “data” folder

We can see using factory_overview() our R Markdown in the “report_sources” folder and the data file in the “clean” data folder (highlighted):

Below is a screenshot of the beginning of the R Markdown “daily_sitrep.Rmd”. You can see that the output format is set to be HTML, via the YAML header output: html_document.

In this simple script, there are commands to:

  • Load necessary packages
  • Import the linelist data using a filepath from the here package (read more in the page on Import and export)
linelist <- import(here("data", "clean", "linelist_cleaned.rds"))
  • Print a summary table of cases, and export it with export() as a .csv file
  • Print an epicurve, and export it with ggsave() as a .png file

You can review just the list of R Markdown reports in the “report_sources” folder with this command:

list_reports()

41.4 Compile

In a report factory, to “compile” a R Markdown report means that the .Rmd script will be run and the output will be produced (as specified in the script YAML e.g. as HTML, Word, PDF, etc).

The factory will automatically create a date- and time-stamped folder for the outputs in the “outputs” folder.

The report itself and any exported files produced by the script (e.g. csv, png, xlsx) will be saved into this folder. In addition, the Rmd script itself will be saved in this folder, so you have a record of that version of the script.

This contrasts with the normal behavior of a “knitted” R Markdown, which saves outputs to the location of the Rmd script. This default behavior can result in crowded, messy folders. The factory aims to improve organization when one needs to run reports frequently.

Compile by name

You can compile a specific report by running compile_reports() and providing the Rmd script name (without .Rmd extension) to reports =. For simplicity, you can skip the reports = and just write the R Markdown name in quotes, as below.

This command would compile only the “daily_sitrep.Rmd” report, saving the HTML report, and the .csv table and .png epicurve exports into a date- and time-stamped sub-folder specific to the report, within the “outputs” folder.

Note that if you choose to provide the .Rmd extension, you must correctly type the extension as it is saved in the file name (.rmd vs. .Rmd).

Also note that when you compile, you may see several files temporarily appear in the “report_sources” folder - but they will soon disappear as they are transferred to the correct “outputs” folder.

Compile by number

You can also specify the Rmd script to compile by providing a number or vector of numbers to reports =. The numbers must align with the order the reports appear when you run list_reports().

# Compile the second and fourth Rmds in the "report_sources" folder
compile_reports(reports = c(2, 4))

Compile all

You can compile all the R Markdown reports in the “report_sources” folder by setting the reports = argument to TRUE.

Compile from sub-folder

You can add sub-folders to the “report_sources” folder. To run an R Markdown report from a subfolder, simply provide the name of the folder to subfolder =. Below is an example of code to compile a Rmd report that lives in a sub_folder of “report_sources”.

compile_reports(
     reports = "summary_for_partners.Rmd",
     subfolder = "for_partners")

You can compile all Rmd reports within a subfolder by providing the subfolder name to reports =, with a slash on the end, as below.

compile_reports(reports = "for_partners/")

Parameterization

As noted in the page on Reports with R Markdown, you can run reports with specified parameters. You can pass these parameters as a list to compile_reports() via the params = argument. For example, in this fictional report there are three parameters provided to the R Markdown reports.

compile_reports(
  reports = "daily_sitrep.Rmd",
  params = list(most_recent_data = TRUE,
                region = "NORTHERN",
                rates_denominator = 10000),
  subfolder = "regional"
)

Using a “run-file”

If you have multiple reports to run, consider creating a R script that contains all the compile_reports() commands. A user can simply run all the commands in this R script and all the reports will compile. You can save this “run-file” to the “scripts” folder.

41.5 Outputs

After we have compiled the reports a few times, the “outputs” folder might look like this (highlights added for clarity):

  • Within “outputs”, sub-folders have been created for each Rmd report
  • Within those, further sub-folders have been created for each unique compiling
    • These are date- and time-stamped (“2021-04-23_T11-07-36” means 23rd April 2021 at 11:07:36)
    • You can edit the date/time-stamp format. See ?compile_reports
  • Within each date/time compiled folder, the report output is stored (e.g. HTML, PDF, Word) along with the Rmd script (version control!) and any other exported files (e.g. table.csv, epidemic_curve.png)

Here is a view inside one of the date/time-stamped folders, for the “daily_sitrep” report. The file path is highlighted in yellow for emphasis.

Finally, below is a screenshot of the HTML report output.

You can use list_outputs() to review a list of the outputs.

41.6 Miscellaneous

Knit

You can still “knit” one of your R Markdown reports by pressing the “Knit” button, if you want. If you do this, as by default, the outputs will appear in the folder where the Rmd is saved - the “report_sources” folder. In prior versions of reportfactory, having any non-Rmd files in “report_sources” would prevent compiling, but this is no longer the case. You can run compile_reports() and no error will occur.

Scripts

We encourage you to utilize the “scripts” folder to store “runfiles” or .R scripts that are sourced by your .Rmd scripts. See the page on R Markdown for tips on how to structure your code across several files.

Extras

  • With reportfactory, you can use the function list_deps() to list all packages required across all the reports in the entire factory.

  • There is an accompanying package in development called rfextras that offers more helper functions to assist you in building reports, such as:

    • load_scripts() - sources/loads all .R scripts in a given folder (the “scripts” folder by default)
    • find_latest() - finds the latest version of a file (e.g. the latest dataset)

41.7 Resources

See the reportfactory package’s Github page

See the rfextras package’s Github page

42 Dashboards with R Markdown

This page will cover the basic use of the flexdashboard package. This package allows you to easily format R Markdown output as a dashboard with panels and pages. The dashboard content can be text, static figures/tables or interactive graphics.

Advantages of flexdashboard:

  • It requires minimal non-standard R coding - with very little practice you can quickly create a dashboard
  • The dashboard can usually be emailed to colleagues as a self-contained HTML file - no server required
  • You can combine flexdashboard with shiny, ggplotly, and other “html widgets” to add interactivity

Disadvantages of flexdashboard:

  • Less customization as compared to using shiny alone to create a dashboard

Very comprehensive tutorials on using flexdashboard that informed this page can be found in the Resources section. Below we describe the core features and give an example of building a dashboard to explore an outbreak, using the case linelist data.

42.1 Preparation

Load packages

In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

pacman::p_load(
  rio,             # data import/export     
  here,            # locate files
  tidyverse,       # data management and visualization
  flexdashboard,   # dashboard versions of R Markdown reports
  shiny,           # interactive figures
  plotly           # interactive figures
)

Import data

We import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import data with the import() function from the rio package (it handles many file types like .xlsx, .csv, .rds - see the Import and export page for details).

# import the linelist
linelist <- import("linelist_cleaned.rds")

The first 50 rows of the linelist are displayed below.

42.2 Create new R Markdown

After you have installed the package, create a new R Markdown file by clicking through to File > New file > R Markdown.

In the window that opens, select “From Template” and select the “Flex Dashboard” template. You will then be prompted to name the document. In this page’s example, we will name our R Markdown as “outbreak_dashboard.Rmd”.

42.3 The script

The script is an R Markdown script, and so has the same components and organization as described in the page on Reports with R Markdown. We briefly re-visit these and highlight differences from other R Markdown output formats.

YAML

At the top of the script is the “YAML” header. This must begin with three dashes --- and must close with three dashes ---. YAML parameters comes in key:value pairs. The indentation and placement of colons in YAML is important - the key:value pairs are separated by colons (not equals signs!).

The YAML should begin with metadata for the document. The order of these primary YAML parameters (not indented) does not matter. For example:

title: "My document"
author: "Me"
date: "`r Sys.Date()`"

You can use R code in YAML values by putting it like in-line code (preceeded by r within backticks) but also within quotes (see above for Date).

A required YAML parameter is output:, which specifies the type of file to be produced (e.g. html_document, pdf_document, word_document, or powerpoint_presentation). For flexdashboard this parameter value is a bit confusing - it must be set as output:flexdashboard::flex_dashboard. Note the single and double colons, and the underscore. This YAML output parameter is often followed by an additional colon and indented sub-parameters (see orientation: and vertical_layout: parameters below).

title: "My dashboard"
author: "Me"
date: "`r Sys.Date()`"
output:
  flexdashboard::flex_dashboard:
    orientation: rows
    vertical_layout: scroll

As shown above, indentations (2 spaces) are used for sub-parameters. In this case, do not forget to put an additional colon after the primary, like key:value:.

If appropriate, logical values should be given in YAML in lowercase (true, false, null). If a colon is part of your value (e.g. in the title) put the value in quotes. See the examples in sections below.

Code chunks

An R Markdown script can contain multiple code “chunks” - these are areas of the script where you can write multiple-line R code and they function just like mini R scripts.

Code chunks are created with three back-ticks and curly brackets with a lowercase “r” within. The chunk is closed with three backticks. You can create a new chunk by typing it out yourself, by using the keyboard shortcut “Ctrl + Alt + i” (or Cmd + Shift + r in Mac), or by clicking the green ‘insert a new code chunk’ icon at the top of your script editor. Many examples are given below.

Narrative text

Outside of an R code “chunk”, you can write narrative text. As described in the page on Reports with R Markdown, you can italicize text by surrounding it with one asterisk (*), or bold by surrounding it with two asterisks (**). Recall that bullets and numbering schemes are sensitive to newlines, indentation, and finishing a line with two spaces.

You can also insert in-line R code into text as described in the Reports with R Markdown page, by surrounding the code with backticks and starting the command with “r”: ` 1+1`(see example with date above).

Headings

Different heading levels are established with different numbers of hash symbols, as described in the Reports with R Markdown page.

In flexdashboard, a primary heading (#) creates a “page” of the dashboard. Second-level headings (##) create a column or a row depending on your orientation: parameter (see details below). Third-level headings (###) create panels for plots, charts, tables, text, etc.

# First-level heading (page)

## Second level heading (row or column)  

### Third-level heading (pane for plot, chart, etc.)

42.4 Section attributes

As in a normal R markdown, you can specify attributes to apply to parts of your dashboard by including key=value options after a heading, within curly brackets { }. For example, in a typical HTML R Markdown report you might organize sub-headings into tabs with ## My heading {.tabset}.

Note that these attributes are written after a heading in a text portion of the script. These are different than the knitr options inserted within at the top of R code chunks, such as out.height =.

Section attributes specific to flexdashboard include:

  • {data-orientation=} Set to either rows or columns. If your dashboard has multiple pages, add this attribute to each page to indicate orientation (further explained in layout section).
  • {data-width=} and {data-height=} set relative size of charts, columns, rows laid out in the same dimension (horizontal or vertical). Absolute sizes are adjusted to best fill the space on any display device thanks to the flexbox engine.
    • Height of charts also depends on whether you set the YAML parameter vertical_layout: fill or vertical_layout: scroll. If set to scroll, figure height will reflect the traditional fig.height = option in the R code chunk.
    • See complete size documentation at the flexdashboard website
  • {.hidden} Use this to exclude a specific page from the navigation bar
  • {data-navbar=} Use this in a page-level heading to nest it within a navigation bar drop-down menu. Provide the name (in quotes) of the drop-down menu. See example below.

42.5 Layout

Adjust the layout of your dashboard in the following ways:

  • Add pages, columns/rows, and charts with R Markdown headings (e.g. #, ##, or ###)
  • Adjust the YAML parameter orientation: to either rows or columns
  • Specify whether the layout fills the browser or allows scrolling
  • Add tabs to a particular section heading

Pages

First-level headings (#) in the R Markdown will represent “pages” of the dashboard. By default, pages will appear in a navigation bar along the top of the dashboard.

You can group pages into a “menu” within the top navigation bar by adding the attribute {data-navmenu=} to the page heading. Be careful - do not include spaces around the equals sign otherwise it will not work!

Here is what the script produces:

You can also convert a page or a column into a “sidebar” on the left side of the dashboard by adding the {.sidebar} attribute. It can hold text (viewable from any page), or if you have integrated shiny interactivity it can be useful to hold user-input controls such as sliders or drop-down menus.

Here is what the script produces:

Orientation

Set the orientation: yaml parameter to indicate how your second-level (##) R Markdown headings should be interpreted - as either orientation: columns or orientation: rows.

Second-level headings (##) will be interpreted as new columns or rows based on this orientation setting.

If you set orientation: columns, second-level headers will create new columns in the dashboard. The below dashboard has one page, containing two columns, with a total of three panels. You can adjust the relative width of the columns with {data-width=} as shown below.

Here is what the script produces:

If you set orientation: rows, second-level headers will create new rows instead of columns. Below is the same script as above, but orientation: rows so that second-level headings produce rows instead of columns. You can adjust the relative height of the rows with {data-height=} as shown below.

Here is what the script produces:

If your dashboard has multiple pages, you can designate the orientation for each specific page by adding the {data-orientation=} attribute the header of each page (specify either rows or columns without quotes).

Tabs

You can divide content into tabs with the {.tabset} attribute, as in other HTML R Markdown outputs.

Simply add this attribute after the desired heading. Sub-headings under that heading will be displayed as tabs. For example, in the example script below column 2 on the right (##) is modified so that the epidemic curve and table panes (###) are displayed in tabs.

You can do the same with rows if your orientation is rows.

Here is what the script produces:

42.6 Adding content

Let’s begin to build a dashboard. Our simple dashboard will have 1 page, 2 columns, and 4 panels. We will build the panels piece-by-piece for demonstration.

You can easily include standard R outputs such as text, ggplots, and tables (see Tables for presentation page). Simply code them within an R code chunk as you would for any other R Markdown script.

Note: you can download the finished Rmd script and HTML dashboard output - see the Download handbook and data page.

Text

You can type in Markdown text and include in-line code as for any other R Markdown output. See the Reports with R Markdown page for details.

In this dashboard we include a summary text panel that includes dynamic text showing the latest hospitalisation date and number of cases reported in the outbreak.

Tables

You can include R code chunks that print outputs such as tables. But the output will look best and respond to the window size if you use the kable() function from knitr to display your tables. The flextable functions may produce tables that are shortened / cut-off.

For example, below we feed the linelist() through a count() command to produce a summary table of cases by hospital. Ultimately, the table is piped to knitr::kable() and the result has a scroll bar on the right. You can read more about customizing your table with kable() and kableExtra here.

Here is what the script produces:

If you want to show a dynamic table that allows the user to filter, sort, and/or click through “pages” of the data frame, use the package DT and it’s function datatable(), as in the code below.

The example code below, the data frame linelist is printed. You can set rownames = FALSE to conserve horizontal space, and filter = "top" to have filters on top of every column. A list of other specifications can be provided to options =. Below, we set pageLength = so that 5 rows appear and scrollX = so the user can use a scroll bar on the bottom to scroll horizontally. The argument class = 'white-space: nowrap' ensures that each row is only one line (not multiple lines). You can read about other possible arguments and values here or by entering ?datatable

DT::datatable(linelist, 
              rownames = FALSE, 
              options = list(pageLength = 5, scrollX = TRUE), 
              class = 'white-space: nowrap' )

Plots

You can print plots to a dashboard pane as you would in an R script. In our example, we use the incidence2 package to create an “epicurve” by age group with two simple commands (see Epidemic curves page). However, you could use ggplot() and print a plot in the same manner.

Here is what the script produces:

Interactive plots

You can also pass a standard ggplot or other plot object to ggplotly() from the plotly package (see the Interactive plots page). This will make your plot interactive, allow the reader to “zoom in”, and show-on-hover the value of every data point (in this scenario the number of cases per week and age group in the curve).

age_outbreak <- incidence(linelist, date_onset, "week", groups = age_cat)
plot(age_outbreak, fill = age_cat, col_pal = muted, title = "") %>% 
  plotly::ggplotly()

Here is what this looks like in the dashboard (gif). This interactive functionality will still work even if you email the dashboard as a static file (not online on a server).

HTML widgets

HTML widgets for R are a special class of R packages that increases interactivity by utilizing JavaScript libraries. You can embed them in R Markdown outputs (such as a flexdashboard) and in Shiny dashboards.

Some common examples of these widgets include:

  • Plotly (used in this handbook page and in the [Interative plots] page)
  • visNetwork (used in the Transmission Chains page of this handbook)
  • Leaflet (used in the GIS Basics page of this handbook)
  • dygraphs (useful for interactively showing time series data)
  • DT (datatable()) (used to show dynamic tables with filter, sort, etc.)

Below we demonstrate adding an epidemic transmission chain which uses visNetwork to the dashboard. The script shows only the new code added to the “Column 2” section of the R Markdown script. You can find the code in the Transmission chains page of this handbook.

Here is what the script produces:

42.7 Code organization

You may elect to have all code within the R Markdown flexdashboard script. Alternatively, to have a more clean and concise dashboard script you may choose to call upon code/figures that are hosted or created in external R scripts. This is described in greater detail in the Reports with R Markdown page.

42.8 Shiny

Integrating the R package shiny can make your dashboards even more reactive to user input. For example, you could have the user select a jurisdiction, or a date range, and have panels react to their choice (e.g. filter the data displayed). To embed shiny reactivity into flexdashboard, you need only make a few changes to your flexdashboard R Markdown script.

You can use shiny to produce apps/dashboards without flexdashboard too. The handbook page on Dashboards with Shiny gives an overview of this approach, including primers on shiny syntax, app file structure, and options for sharing/publishing (including free server options). These syntax and general tips translate into the flexdashboard context as well.

Embedding shiny in flexdashboard is however, a fundamental change to your flexdashboard. It will no longer produce an HTML output that you can send by email and anyone could open and view. Instead, it will be an “app”. The “Knit” button at the top of the script will be replaced by a “Run document” icon, which will open an instance of the interactive the dashboard locally on your computer.

Sharing your dashboard will now require that you either:

  • Send the Rmd script to the viewer, they open it in R on their computer, and run the app, or
  • The app/dashboard is hosted on a server accessible to the viewer

Thus, there are benefits to integrating shiny, but also complications. If easy sharing by email is a priority and you don’t need shiny reactive capabilities, consider the reduced interactivity offered by ggplotly() as demonstrated above.

Below we give a very simple example using the same “outbreak_dashboard.Rmd” as above. Extensive documentation on integrating Shiny into flexdashboard is available online here.

Settings

Enable shiny in a flexdashboard by adding the YAML parameter runtime: shiny at the same indentation level as output:, as below:

---
title: "Outbreak dashboard (Shiny demo)"
output: 
  flexdashboard::flex_dashboard:
    orientation: columns
    vertical_layout: fill
runtime: shiny
---

It is also convenient to enable a “side bar” to hold the shiny input widgets that will collect information from the user. As explained above, create a column and indicate the {.sidebar} option to create a side bar on the left side. You can add text and R chunks containing the shiny input commands within this column.

If your app/dashboard is hosted on a server and may have multiple simultaneous users, name the first R code chunk as global. Include the commands to import/load your data in this chunk. This special named chunk is treated differently, and the data imported within it are only imported once (not continuously) and are available for all users. This improves the start-up speed of the app.

Worked example

Here we adapt the flexdashboard script “outbreak_dashboard.Rmd” to include shiny. We will add the capability for the user to select a hospital from a drop-down menu, and have the epidemic curve reflect only cases from that hospital, with a dynamic plot title. We do the following:

  • Add runtime: shiny to the YAML
  • Re-name the setup chunk as global
  • Create a sidebar containing:
    • Code to create a vector of unique hospital names
    • A selectInput() command (shiny drop-down menu) with the choice of hospital names. The selection is saved as hospital_choice, which can be referenced in later code as input$hospital_choice
  • The epidemic curve code (column 2) is wrapped within renderPlot({ }), including:
    • A filter on the dataset restricting the column hospital to the current value of input$hospital_choice
    • A dynamic plot title that incorporates input$hospital_choice

Note that any code referencing an input$ value must be within a render({}) function (to be reactive).

Here is the top of the script, including YAML, global chunk, and sidebar:

Here is the Column 2, with the reactive epicurve plot:

And here is the dashboard:

Other examples

To read a health-related example of a Shiny-flexdashboard using the shiny interactivity and the leaflet mapping widget, see this chapter of the online book Geospatial Health Data: Modeling and Visualization with R-INLA and Shiny.

42.9 Sharing

Dashboards that do not contain Shiny elements will output an HTML file (.html), which can be emailed (if size permits). This is useful, as you can send the “dashboard” report and not have to set up a server to host it as a website.

If you have embedded shiny, you will not be able to send an output by email, but you can send the script itself to an R user, or host the dashboard on a server as explained above.

42.10 Resources

Excellent tutorials that informed this page can be found below. If you review these, most likely within an hour you can have your own dashboard.

https://bookdown.org/yihui/rmarkdown/dashboards.html

https://rmarkdown.rstudio.com/flexdashboard/

https://rmarkdown.rstudio.com/flexdashboard/using.html

https://rmarkdown.rstudio.com/flexdashboard/examples.html

43 Dashboards with Shiny

Dashboards are often a great way to share results from analyses with others. Producing a dashboard with shiny requires a relatively advanced knowledge of the R language, but offers incredible customization and possibilities.

It is recommended that someone learning dashboards with shiny has good knowledge of data transformation and visualisation, and is comfortable debugging code, and writing functions. Working with dashboards is not intuitive when you’re starting, and is difficult to understand at times, but is a great skill to learn and gets much easier with practice!

This page will give a short overview of how to make dashboards with shiny and its extensions. For an alternative method of making dashboards that is faster, easier, but perhaps less customizeable, see the page on flextable (Dashboards with R Markdown).

43.1 Preparation

Load packages

In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

We begin by installing the shiny R package:

pacman::p_load("shiny")

Import data

If you would like to follow-along with this page, see this section of the Download handbook and data. There are links to download the R scripts and data files that produce the final Shiny app.

If you try to re-construct the app using these files, please be aware of the R project folder structure that is created over the course of the demonstration (e.g. folders for “data” and for “funcs”).

43.2 The structure of a shiny app

Basic file structures

To understand shiny, we first need to understand how the file structure of an app works! We should make a brand new directory before we start. This can actually be made easier by choosing New project in Rstudio, and choosing Shiny Web Application. This will create the basic structure of a shiny app for you.

When opening this project, you’ll notice there is a .R file already present called app.R. It is essential that we have one of two basic file structures:

  1. One file called app.R, or
  2. Two files, one called ui.R and the other server.R

In this page, we will use the first approach of having one file called app.R. Here is an example script:

# an example of app.R

library(shiny)

ui <- fluidPage(

    # Application title
    titlePanel("My app"),

    # Sidebar with a slider input widget
    sidebarLayout(
        sidebarPanel(
            sliderInput("input_1")
        ),

        # Show a plot 
        mainPanel(
           plotOutput("my_plot")
        )
    )
)

# Define server logic required to draw a histogram
server <- function(input, output) {
     
     plot_1 <- reactive({
          plot_func(param = input_1)
     })
     
    output$my_plot <- renderPlot({
       plot_1()
    })
}


# Run the application 
shinyApp(ui = ui, server = server)

If you open this file, you’ll notice that two objects are defined - one called ui and another called server. These objects must be defined in every shiny app and are central to the structure of the app itself! In fact, the only difference between the two file structures described above is that in structure 1, both ui and server are defined in one file, whereas in structure 2 they are defined in separate files. Note: we can also (and we should if we have a larger app) have other .R files in our structure that we can source() into our app.

The server and the ui

We next need to understand what the server and ui objects actually do. Put simply, these are two objects that are interacting with each other whenever the user interacts with the shiny app.

The UI element of a shiny app is, on a basic level, R code that creates an HTML interface. This means everything that is displayed in the UI of an app. This generally includes:

  • “Widgets” - dropdown menus, check boxes, sliders, etc that can be interacted with by the user
  • Plots, tables, etc - outputs that are generated with R code
  • Navigation aspects of an app - tabs, panes, etc.
  • Generic text, hyperlinks, etc
  • HTML and CSS elements (addressed later)

The most important thing to understand about the UI is that it receives inputs from the user and displays outputs from the server. There is no active code running in the ui at any time - all changes seen in the UI are passed through the server (more or less). So we have to make our plots, downloads, etc in the server

The server of the shiny app is where all code is being run once the app starts up. The way this works is a little confusing. The server function will effectively react to the user interfacing with the UI, and run chunks of code in response. If things change in the server, these will be passed back up to the ui, where the changes can be seen. Importantly, the code in the server will be executed non-consecutively (or it’s best to think of it this way). Basically, whenever a ui input affects a chunk of code in the server, it will run automatically, and that output will be produced and displayed.

This all probably sounds very abstract for now, so we’ll have to dive into some examples to get a clear idea of how this actually works.

Before you start to build an app

Before you begin to build an app, its immensely helpful to know what you want to build. Since your UI will be written in code, you can’t really visualise what you’re building unless you are aiming for something specific. For this reason, it is immensely helpful to look at lots of examples of shiny apps to get an idea of what you can make - even better if you can look at the source code behind these apps! Some great resources for this are:

Once you get an idea for what is possible, it’s also helpful to map out what you want yours to look like - you can do this on paper or in any drawing software (PowerPoint, MS paint, etc.). It’s helpful to start simple for your first app! There’s also no shame in using code you find online of a nice app as a template for your work - its much easier than building something from scratch!

43.3 Building a UI

When building our app, its easier to work on the UI first so we can see what we’re making, and not risk the app failing because of any server errors. As mentioned previously, its often good to use a template when working on the UI. There are a number of standard layouts that can be used with shiny that are available from the base shiny package, but it’s worth noting that there are also a number of package extensions such as shinydashboard. We’ll use an example from base shiny to start with.

A shiny UI is generally defined as a series of nested functions, in the following order

  1. A function defining the general layout (the most basic is fluidPage(), but more are available)
  2. Panels within the layout such as:
    • a sidebar (sidebarPanel())
    • a “main” panel (mainPanel())
    • a tab (tabPanel())
    • a generic “column” (column())
  3. Widgets and outputs - these can confer inputs to the server (widgets) or outputs from the server (outputs)
    • Widgets generally are styled as xxxInput() e.g. selectInput()
    • Outputs are generally styled as xxxOutput() e.g. plotOutput()

It’s worth stating again that these can’t be visualised easily in an abstract way, so it’s best to look at an example! Lets consider making a basic app that visualises our malaria facility count data by district. This data has a lot of differnet parameters, so it would be great if the end user could apply some filters to see the data by age group/district as they see fit! We can use a very simple shiny layout to start - the sidebar layout. This is a layout where widgets are placed in a sidebar on the left, and the plot is placed on the right.

Lets plan our app - we can start with a selector that lets us choose the district where we want to visualise data, and another to let us visualise the age group we are interested in. We’ll aim to use these filters to show an epicurve that reflects these parameters. So for this we need:

  1. Two dropdown menus that let us choose the district we want, and the age group we’re interested in.
  2. An area where we can show our resulting epicurve.

This might look something like this:

library(shiny)

ui <- fluidPage(

  titlePanel("Malaria facility visualisation app"),

  sidebarLayout(

    sidebarPanel(
         # selector for district
         selectInput(
              inputId = "select_district",
              label = "Select district",
              choices = c(
                   "All",
                   "Spring",
                   "Bolo",
                   "Dingo",
                   "Barnard"
              ),
              selected = "All",
              multiple = TRUE
         ),
         # selector for age group
         selectInput(
              inputId = "select_agegroup",
              label = "Select age group",
              choices = c(
                   "All ages" = "malaria_tot",
                   "0-4 yrs" = "malaria_rdt_0-4",
                   "5-14 yrs" = "malaria_rdt_5-14",
                   "15+ yrs" = "malaria_rdt_15"
              ), 
              selected = "All",
              multiple = FALSE
         )

    ),

    mainPanel(
      # epicurve goes here
      plotOutput("malaria_epicurve")
    )
    
  )
)

When app.R is run with the above UI code (with no active code in the server portion of app.R) the layout appears looking like this - note that there will be no plot if there is no server to render it, but our inputs are working!

This is a good opportunity to discuss how widgets work - note that each widget is accepting an inputId, a label, and a series of other options that are specific to the widget type. This inputId is extremely important - these are the IDs that are used to pass information from the UI to the server. For this reason, they must be unique. You should make an effort to name them something sensible, and specific to what they are interacting with in cases of larger apps.

You should read documentation carefully for full details on what each of these widgets do. Widgets will pass specific types of data to the server depending on the widget type, and this needs to be fully understood. For example, selectInput() will pass a character type to the server:

  • If we select Spring for the first widget here, it will pass the character object "Spring" to the server.
  • If we select two items from the dropdown menu, they will come through as a character vector (e.g. c("Spring", "Bolo")).

Other widgets will pass different types of object to the server! For example:

  • numericInput() will pass a numeric type object to the server
  • checkboxInput() will pass a logical type object to the server (TRUE or FALSE)

It’s also worth noting the named vector we used for the age data here. For many widgets, using a named vector as the choices will display the names of the vector as the display choices, but pass the selected value from the vector to the server. I.e. here someone can select “15+” from the drop-down menu, and the UI will pass "malaria_rdt_15" to the server - which happens to be the name of the column we’re interested in!

There are loads of widgets that you can use to do lots of things with your app. Widgets also allow you to upload files into your app, and download outputs. There are also some excellent shiny extensions that give you access to more widgets than base shiny - the shinyWidgets package is a great example of this. To look at some examples you can look at the following links:

43.4 Loading data into our app

The next step in our app development is getting the server up and running. To do this however, we need to get some data into our app, and figure out all the calculations we’re going to do. A shiny app is not straightforward to debug, as it’s often not clear where errors are coming from, so it’s ideal to get all our data processing and visualisation code working before we start making the server itself.

So given we want to make an app that shows epi curves that change based on user input, we should think about what code we would need to run this in a normal R script. We’ll need to:

  1. Load our packages
  2. Load our data
  3. Transform our data
  4. Develop a function to visualise our data based on user inputs

This list is pretty straightforward, and shouldn’t be too hard to do. It’s now important to think about which parts of this process need to be done only once and which parts need to run in response to user inputs. This is because shiny apps generally run some code before running, which is only performed once. It will help our app’s performance if as much of our code can be moved to this section. For this example, we only need to load our data/packages and do basic transformations once, so we can put that code outside the server. This means the only thing we’ll need in the server is the code to visualise our data. Lets develop all of these componenets in a script first. However, since we’re visualising our data with a function, we can also put the code for the function outside the server so our function is in the environment when the app runs!

First lets load our data. Since we’re working with a new project, and we want to make it clean, we can create a new directory called data, and add our malaria data in there. We can run this code below in a testing script we will eventually delete when we clean up the structure of our app.

pacman::p_load("tidyverse", "lubridate")

# read data
malaria_data <- rio::import(here::here("data", "malaria_facility_count_data.rds")) %>% 
  as_tibble()

print(malaria_data)
## # A tibble: 3,038 x 10
##    location_name data_date  submitted_date Province District `malaria_rdt_0-4` `malaria_rdt_5-14` malaria_rdt_15 malaria_tot newid
##    <chr>         <date>     <date>         <chr>    <chr>                <int>              <int>          <int>       <int> <int>
##  1 Facility 1    2020-08-11 2020-08-12     North    Spring                  11                 12             23          46     1
##  2 Facility 2    2020-08-11 2020-08-12     North    Bolo                    11                 10              5          26     2
##  3 Facility 3    2020-08-11 2020-08-12     North    Dingo                    8                  5              5          18     3
##  4 Facility 4    2020-08-11 2020-08-12     North    Bolo                    16                 16             17          49     4
##  5 Facility 5    2020-08-11 2020-08-12     North    Bolo                     9                  2              6          17     5
##  6 Facility 6    2020-08-11 2020-08-12     North    Dingo                    3                  1              4           8     6
##  7 Facility 6    2020-08-10 2020-08-12     North    Dingo                    4                  0              3           7     6
##  8 Facility 5    2020-08-10 2020-08-12     North    Bolo                    15                 14             13          42     5
##  9 Facility 5    2020-08-09 2020-08-12     North    Bolo                    11                 11             13          35     5
## 10 Facility 5    2020-08-08 2020-08-12     North    Bolo                    19                 15             15          49     5
## # ... with 3,028 more rows

It will be easier to work with this data if we use tidy data standards, so we should also transform into a longer data format, where age group is a column, and cases is another column. We can do this easily using what we’ve learned in the Pivoting data page.

malaria_data <- malaria_data %>%
  select(-newid) %>%
  pivot_longer(cols = starts_with("malaria_"), names_to = "age_group", values_to = "cases_reported")

print(malaria_data)
## # A tibble: 12,152 x 7
##    location_name data_date  submitted_date Province District age_group        cases_reported
##    <chr>         <date>     <date>         <chr>    <chr>    <chr>                     <int>
##  1 Facility 1    2020-08-11 2020-08-12     North    Spring   malaria_rdt_0-4              11
##  2 Facility 1    2020-08-11 2020-08-12     North    Spring   malaria_rdt_5-14             12
##  3 Facility 1    2020-08-11 2020-08-12     North    Spring   malaria_rdt_15               23
##  4 Facility 1    2020-08-11 2020-08-12     North    Spring   malaria_tot                  46
##  5 Facility 2    2020-08-11 2020-08-12     North    Bolo     malaria_rdt_0-4              11
##  6 Facility 2    2020-08-11 2020-08-12     North    Bolo     malaria_rdt_5-14             10
##  7 Facility 2    2020-08-11 2020-08-12     North    Bolo     malaria_rdt_15                5
##  8 Facility 2    2020-08-11 2020-08-12     North    Bolo     malaria_tot                  26
##  9 Facility 3    2020-08-11 2020-08-12     North    Dingo    malaria_rdt_0-4               8
## 10 Facility 3    2020-08-11 2020-08-12     North    Dingo    malaria_rdt_5-14              5
## # ... with 12,142 more rows

And with that we’ve finished preparing our data! This crosses items 1, 2, and 3 off our list of things to develop for our “testing R script”. The last, and most difficult task will be building a function to produce an epicurve based on user defined parameters. As mentioned previously, it’s highly recommended that anyone learning shiny first look at the section on functional programming (Writing functions) to understand how this works!

When defining our function, it might be hard to think about what parameters we want to include. For functional programming with shiny, every relevent parameter will generally have a widget associated with it, so thinking about this is usually quite easy! For example in our current app, we want to be able to filter by district, and have a widget for this, so we can add a district parameter to reflect this. We don’t have any app functionality to filter by facility (for now), so we don’t need to add this as a parameter. Lets start by making a function with three parameters:

  1. The core dataset
  2. The district of choice
  3. The age group of choice
plot_epicurve <- function(data, district = "All", agegroup = "malaria_tot") {
  
  if (!("All" %in% district)) {
    data <- data %>%
      filter(District %in% district)
    
    plot_title_district <- stringr::str_glue("{paste0(district, collapse = ', ')} districts")
    
  } else {
    
    plot_title_district <- "all districts"
    
  }
  
  # if no remaining data, return NULL
  if (nrow(data) == 0) {
    
    return(NULL)
  }
  
  data <- data %>%
    filter(age_group == agegroup)
  
  
  # if no remaining data, return NULL
  if (nrow(data) == 0) {
    
    return(NULL)
  }
  
  if (agegroup == "malaria_tot") {
      agegroup_title <- "All ages"
  } else {
    agegroup_title <- stringr::str_glue("{str_remove(agegroup, 'malaria_rdt')} years")
  }
  
  
  ggplot(data, aes(x = data_date, y = cases_reported)) +
    geom_col(width = 1, fill = "darkred") +
    theme_minimal() +
    labs(
      x = "date",
      y = "number of cases",
      title = stringr::str_glue("Malaria cases - {plot_title_district}"),
      subtitle = agegroup_title
    )
  
  
  
}

We won’t go into great detail about this function, as it’s relatively simple in how it works. One thing to note however, is we handle errors by returning NULL when it would otherwise give an error. This is because when a shiny server produces a NULL object instead of a plot object, nothing will be shown in the ui! This is important, as otherwise errors will often cause your app to stop working.

Another thing to note is the use of the %in% operator when evaluating the district input. As mentioned above, this could arrive as a character vector with multiple values, so using %in% is more flexible than say, ==.

Let’s test our function!

plot_epicurve(malaria_data, district = "Bolo", agegroup = "malaria_rdt_0-4")

With our function working, we now have to understand how this all is going to fit into our shiny app. We mentioned the concept of startup code before, but lets look at how we can actually incorporate this into the structure of our app. There are two ways we can do this!

  1. Put this code in your app.R file at the start of the script (above the UI), or
  2. Create a new file in your app’s directory called global.R, and put the startup code in this file.

It’s worth noting at this point that it’s generally easier, especially with bigger apps, to use the second file structure, as it lets you separate your file structure in a simple way. Lets fully develop a this global.R script now. Here is what it could look like:

# global.R script

pacman::p_load("tidyverse", "lubridate", "shiny")

# read data
malaria_data <- rio::import(here::here("data", "malaria_facility_count_data.rds")) %>% 
  as_tibble()

# clean data and pivot longer
malaria_data <- malaria_data %>%
  select(-newid) %>%
  pivot_longer(cols = starts_with("malaria_"), names_to = "age_group", values_to = "cases_reported")


# define plotting function
plot_epicurve <- function(data, district = "All", agegroup = "malaria_tot") {
  
  # create plot title
  if (!("All" %in% district)) {            
    data <- data %>%
      filter(District %in% district)
    
    plot_title_district <- stringr::str_glue("{paste0(district, collapse = ', ')} districts")
    
  } else {
    
    plot_title_district <- "all districts"
    
  }
  
  # if no remaining data, return NULL
  if (nrow(data) == 0) {
    
    return(NULL)
  }
  
  # filter to age group
  data <- data %>%
    filter(age_group == agegroup)
  
  
  # if no remaining data, return NULL
  if (nrow(data) == 0) {
    
    return(NULL)
  }
  
  if (agegroup == "malaria_tot") {
      agegroup_title <- "All ages"
  } else {
    agegroup_title <- stringr::str_glue("{str_remove(agegroup, 'malaria_rdt')} years")
  }
  
  
  ggplot(data, aes(x = data_date, y = cases_reported)) +
    geom_col(width = 1, fill = "darkred") +
    theme_minimal() +
    labs(
      x = "date",
      y = "number of cases",
      title = stringr::str_glue("Malaria cases - {plot_title_district}"),
      subtitle = agegroup_title
    )
  
  
  
}

Easy! One great feature of shiny is that it will understand what files named app.R, server.R, ui.R, and global.R are for, so there is no need to connect them to each other via any code. So just by having this code in global.R in the directory it will run before we start our app!.

We should also note that it would improve our app’s organisation if we moved the plotting function to its own file - this will be especially helpful as apps become larger. To do this, we could make another directory called funcs, and put this function in as a file called plot_epicurve.R. We could then read this function in via the following command in global.R

source(here("funcs", "plot_epicurve.R"), local = TRUE)

Note that you should always specify local = TRUE in shiny apps, since it will affect sourcing when/if the app is published on a server.

43.5 Developing an app server

Now that we have most of our code, we just have to develop our server. This is the final piece of our app, and is probably the hardest to understand. The server is a large R function, but its helpful to think of it as a series of smaller functions, or tasks that the app can perform. It’s important to understand that these functions are not executed in a linear order. There is an order to them, but it’s not fully necessary to understand when starting out with shiny. At a very basic level, these tasks or functions will activate when there is a change in user inputs that affects them, unless the developer has set them up so they behave differently. Again, this is all quite abstract, but lets first go through the three basic types of shiny objects

  1. Reactive sources - this is another term for user inputs. The shiny server has access to the outputs from the UI through the widgets we’ve programmed. Every time the values for these are changed, this is passed down to the server.

  2. Reactive conductors - these are objects that exist only inside the shiny server. We don’t actually need these for simple apps, but they produce objects that can only be seen inside the server, and used in other operations. They generally depend on reactive sources.

  3. Endpoints - these are outputs that are passed from the server to the UI. In our example, this would be the epi curve we are producing.

With this in mind lets construct our server step-by-step. We’ll show our UI code again here just for reference:

ui <- fluidPage(

  titlePanel("Malaria facility visualisation app"),

  sidebarLayout(

    sidebarPanel(
         # selector for district
         selectInput(
              inputId = "select_district",
              label = "Select district",
              choices = c(
                   "All",
                   "Spring",
                   "Bolo",
                   "Dingo",
                   "Barnard"
              ),
              selected = "All",
              multiple = TRUE
         ),
         # selector for age group
         selectInput(
              inputId = "select_agegroup",
              label = "Select age group",
              choices = c(
                   "All ages" = "malaria_tot",
                   "0-4 yrs" = "malaria_rdt_0-4",
                   "5-14 yrs" = "malaria_rdt_5-14",
                   "15+ yrs" = "malaria_rdt_15"
              ), 
              selected = "All",
              multiple = FALSE
         )

    ),

    mainPanel(
      # epicurve goes here
      plotOutput("malaria_epicurve")
    )
    
  )
)

From this code UI we have:

  • Two inputs:
    • District selector (with an inputId of select_district)
    • Age group selector (with an inputId of select_agegroup)
  • One output:
    • The epicurve (with an outputId of malaria_epicurve)

As stated previously, these unique names we have assigned to our inputs and outputs are crucial. They must be unique and are used to pass information between the ui and server. In our server, we access our inputs via the syntax input$inputID and outputs and passed to the ui through the syntax output$output_name Lets have a look at an example, because again this is hard to understand otherwise!

server <- function(input, output, session) {
  
  output$malaria_epicurve <- renderPlot(
    plot_epicurve(malaria_data, district = input$select_district, agegroup = input$select_agegroup)
  )
  
}

The server for a simple app like this is actually quite straightforward! You’ll notice that the server is a function with three parameters - input, output, and session - this isn’t that important to understand for now, but its important to stick to this setup! In our server we only have one task - this renders a plot based on our function we made earlier, and the inputs from the server. Notice how the names of the input and output objects correspond exactly to those in the ui.

To understand the basics of how the server reacts to user inputs, you should note that the output will know (through the underlying package) when inputs change, and rerun this function to create a plot every time they change. Note that we also use the renderPlot() function here - this is one of a family of class-specific functions that pass those objects to a ui output. There are a number of functions that behave similarly, but you need to ensure the function used matches the class of object you’re passing to the ui! For example:

  • renderText() - send text to the ui
  • renderDataTable - send an interactive table to the ui.

Remember that these also need to match the output function used in the ui - so renderPlot() is paired with plotOutput(), and renderText() is matched with textOutput().

So we’ve finally made a functioning app! We can run this by pressing the Run App button on the top right of the script window in Rstudio. You should note that you can choose to run your app in your default browser (rather than Rstudio) which will more accurately reflect what the app will look like for other users.

It is fun to note that in the R console, the app is “listening”! Talk about reactivity!

43.6 Adding more functionality

At this point we’ve finally got a running app, but we have very little functionality. We also haven’t really scratched the surface of what shiny can do, so there’s a lot more to learn about! Lets continue to build our existing app by adding some extra features. Some things that could be nice to add could be:

  1. Some explanatory text
  2. A download button for our plot - this would provide the user with a high quality version of the image that they’re generating in the app
  3. A selector for specific facilities
  4. Another dashboard page - this could show a table of our data.

This is a lot to add, but we can use it to learn about a bunch of different shiny featues on the way. There is so much to learn about shiny (it can get very advanced, but its hopefully the case that once users have a better idea of how to use it they can become more comfortable using external learning sources as well).

Adding static text

Lets first discuss adding static text to our shiny app. Adding text to our app is extremely easy, once you have a basic grasp of it. Since static text doesn’t change in the shiny app (If you’d like it to change, you can use text rendering functions in the server!), all of shiny’s static text is generally added in the ui of the app. We wont go through this in great detail, but you can add a number of different elements to your ui (and even custom ones) by interfacing R with HTML and css.

HTML and css are languages that are explicitly involved in user interface design. We don’t need to understand these too well, but HTML creates objects in UI (like a text box, or a table), and css is generally used to change the style and aesthetics of those objects. Shiny has access to a large array of HTML tags - these are present for objects that behave in a specific way, such as headers, paragraphs of text, line breaks, tables, etc. We can use some of these examples like this:

  • h1() - this a a header tag, which will make enclosed text automatically larger, and change defaults as they pertain to the font face, colour etc (depending on the overall theme of your app). You can access smaller and smaller sub-heading with h2() down to h6() as well. Usage looks like:

    • h1("my header - section 1")
  • p() - this is a paragraph tag, which will make enclosed text similar to text in a body of text. This text will automatically wrap, and be of a relatively small size (footers could be smaller for example.) Think of it as the text body of a word document. Usage looks like:

    • p("This is a larger body of text where I am explaining the function of my app")
  • tags$b() and tags$i() - these are used to create bold tags$b() and italicised tags$i() with whichever text is enclosed!

  • tags$ul(), tags$ol() and tags$li() - these are tags used in creating lists. These are all used within the syntax below, and allow the user to create either an ordered list (tags$ol(); i.e. numbered) or unordered list (tags$ul(), i.e. bullet points). tags$li() is used to denote items in the list, regardless of which type of list is used. e.g.:

tags$ol(
  
  tags$li("Item 1"),
  
  tags$li("Item 2"),
  
  tags$li("Item 3")
  
)
  • br() and hr() - these tags create linebreaks and horizontal lines (with a linebreak) respectively. Use them to separate out the sections of your app and text! There is no need to pass any items to these tags (parentheses can remain empty).

  • div() - this is a generic tag that can contain anything, and can be named anything. Once you progress with ui design, you can use these to compartmentalize your ui, give specific sections specific styles, and create interactions between the server and UI elements. We won’t go into these in detail, but they’re worth being aware of!

Note that every one of these objects can be accessed through tags$... or for some, just the function. These are effectively synonymous, but it may help to use the tags$... style if you’d rather be more explicit and not overwrite the functions accidentally. This is also by no means an exhaustive list of tags available. There is a full list of all tags available in shiny here and even more can be used by inserting HTML directly into your ui!

If you’re feeling confident, you can also add any css styling elements to your HTML tags with the style argument in any of them. We won’t go into how this works in detail, but one tip for testing aesthetic changes to a UI is using the HTML inspector mode in chrome (of your shiny app you are running in browser), and editing the style of objects yourself!

Lets add some text to our app

ui <- fluidPage(

  titlePanel("Malaria facility visualisation app"),

  sidebarLayout(

    sidebarPanel(
         h4("Options"),
         # selector for district
         selectInput(
              inputId = "select_district",
              label = "Select district",
              choices = c(
                   "All",
                   "Spring",
                   "Bolo",
                   "Dingo",
                   "Barnard"
              ),
              selected = "All",
              multiple = TRUE
         ),
         # selector for age group
         selectInput(
              inputId = "select_agegroup",
              label = "Select age group",
              choices = c(
                   "All ages" = "malaria_tot",
                   "0-4 yrs" = "malaria_rdt_0-4",
                   "5-14 yrs" = "malaria_rdt_5-14",
                   "15+ yrs" = "malaria_rdt_15"
              ), 
              selected = "All",
              multiple = FALSE
         ),
    ),

    mainPanel(
      # epicurve goes here
      plotOutput("malaria_epicurve"),
      br(),
      hr(),
      p("Welcome to the malaria facility visualisation app! To use this app, manipulate the widgets on the side to change the epidemic curve according to your preferences! To download a high quality image of the plot you've created, you can also download it with the download button. To see the raw data, use the raw data tab for an interactive form of the table. The data dictionary is as follows:"),
    tags$ul(
      tags$li(tags$b("location_name"), " - the facility that the data were collected at"),
      tags$li(tags$b("data_date"), " - the date the data were collected at"),
      tags$li(tags$b("submitted_daate"), " - the date the data were submitted at"),
      tags$li(tags$b("Province"), " - the province the data were collected at (all 'North' for this dataset)"),
      tags$li(tags$b("District"), " - the district the data were collected at"),
      tags$li(tags$b("age_group"), " - the age group the data were collected for (0-5, 5-14, 15+, and all ages)"),
      tags$li(tags$b("cases_reported"), " - the number of cases reported for the facility/age group on the given date")
    )
    
  )
)
)

Adding a download button

Lets move on to the second of the three features. A download button is a fairly common thing to add to an app and is fairly easy to make. We need to add another Widget to our ui, and we need to add another output to our server to attach to it. We can also introduce reactive conductors in this example!

Lets update our ui first - this is easy as shiny comes with a widget called downloadButton() - lets give it an inputId and a label.

ui <- fluidPage(

  titlePanel("Malaria facility visualisation app"),

  sidebarLayout(

    sidebarPanel(
         # selector for district
         selectInput(
              inputId = "select_district",
              label = "Select district",
              choices = c(
                   "All",
                   "Spring",
                   "Bolo",
                   "Dingo",
                   "Barnard"
              ),
              selected = "All",
              multiple = FALSE
         ),
         # selector for age group
         selectInput(
              inputId = "select_agegroup",
              label = "Select age group",
              choices = c(
                   "All ages" = "malaria_tot",
                   "0-4 yrs" = "malaria_rdt_0-4",
                   "5-14 yrs" = "malaria_rdt_5-14",
                   "15+ yrs" = "malaria_rdt_15"
              ), 
              selected = "All",
              multiple = FALSE
         ),
         # horizontal line
         hr(),
         downloadButton(
           outputId = "download_epicurve",
           label = "Download plot"
         )

    ),

    mainPanel(
      # epicurve goes here
      plotOutput("malaria_epicurve"),
      br(),
      hr(),
      p("Welcome to the malaria facility visualisation app! To use this app, manipulate the widgets on the side to change the epidemic curve according to your preferences! To download a high quality image of the plot you've created, you can also download it with the download button. To see the raw data, use the raw data tab for an interactive form of the table. The data dictionary is as follows:"),
      tags$ul(
        tags$li(tags$b("location_name"), " - the facility that the data were collected at"),
        tags$li(tags$b("data_date"), " - the date the data were collected at"),
        tags$li(tags$b("submitted_daate"), " - the date the data were submitted at"),
        tags$li(tags$b("Province"), " - the province the data were collected at (all 'North' for this dataset)"),
        tags$li(tags$b("District"), " - the district the data were collected at"),
        tags$li(tags$b("age_group"), " - the age group the data were collected for (0-5, 5-14, 15+, and all ages)"),
        tags$li(tags$b("cases_reported"), " - the number of cases reported for the facility/age group on the given date")
      )
      
    )
    
  )
)

Note that we’ve also added in a hr() tag - this adds a horizontal line separating our control widgets from our download widgets. This is another one of the HTML tags that we discussed previously.

Now that we have our ui ready, we need to add the server component. Downloads are done in the server with the downloadHandler() function. Similar to our plot, we need to attach it to an output that has the same inputId as the download button. This function takes two arguments - filename and content - these are both functions. As you might be able to guess, filename is used to specify the name of the downloaded file, and content is used to specify what should be downloaded. content contain a function that you would use to save data locally - so if you were downloading a csv file you could use rio::export(). Since we’re downloading a plot, we’ll use ggplot2::ggsave(). Lets look at how we would program this (we won’t add it to the server yet).

server <- function(input, output, session) {
  
  output$malaria_epicurve <- renderPlot(
    plot_epicurve(malaria_data, district = input$select_district, agegroup = input$select_agegroup)
  )
  
  output$download_epicurve <- downloadHandler(
    filename = function() {
      stringr::str_glue("malaria_epicurve_{input$select_district}.png")
    },
    
    content = function(file) {
      ggsave(file, 
             plot_epicurve(malaria_data, district = input$select_district, agegroup = input$select_agegroup),
             width = 8, height = 5, dpi = 300)
    }
    
  )
  
}

Note that the content function always takes a file argument, which we put where the output file name is specified. You might also notice that we’re repeating code here - we are using our plot_epicurve() function twice in this server, once for the download and once for the image displayed in the app. While this wont massively affect performance, this means that the code to generate this plot will have to be run when the user changes the widgets specifying the district and age group, and again when you want to download the plot. In larger apps, suboptimal decisions like this one will slow things down more and more, so it’s good to learn how to make our app more efficient in this sense. What would make more sense is if we had a way to run the epicurve code when the districts/age groups are changes, and let that be used by the renderPlot() and downloadHandler() functions. This is where reactive conductors come in!

Reactive conductors are objects that are created in the shiny server in a reactive way, but are not outputted - they can just be used by other parts of the server. There are a number of different kinds of reactive conductors, but we’ll go through the basic two.

1.reactive() - this is the most basic reactive conductor - it will react whenever any inputs used inside of it change (so our district/age group widgets)
2. eventReactive()- this rective conductor works the same as reactive(), except that the user can specify which inputs cause it to rerun. This is useful if your reactive conductor takes a long time to process, but this will be explained more later.

Lets look at the two examples:

malaria_plot_r <- reactive({
  
  plot_epicurve(malaria_data, district = input$select_district, agegroup = input$select_agegroup)
  
})


# only runs when the district selector changes!
malaria_plot_er <- eventReactive(input$select_district, {
  
  plot_epicurve(malaria_data, district = input$select_district, agegroup = input$select_agegroup)
  
})

When we use the eventReactive() setup, we can specify which inputs cause this chunk of code to run - this isn’t very useful to us at the moment, so we can leave it for now. Note that you can include multiple inputs with c()

Lets look at how we can integrate this into our server code:

server <- function(input, output, session) {
  
  malaria_plot <- reactive({
    plot_epicurve(malaria_data, district = input$select_district, agegroup = input$select_agegroup)
  })
  
  
  
  output$malaria_epicurve <- renderPlot(
    malaria_plot()
  )
  
  output$download_epicurve <- downloadHandler(
    
    filename = function() {
      stringr::str_glue("malaria_epicurve_{input$select_district}.png")
    },
    
    content = function(file) {
      ggsave(file, 
             malaria_plot(),
             width = 8, height = 5, dpi = 300)
    }
    
  )
  
}

You can see we’re just calling on the output of our reactive we’ve defined in both our download and plot rendering functions. One thing to note that often trips people up is you have to use the outputs of reactives as if they were functions - so you must add empty brackets at the end of them (i.e. malaria_plot() is correct, and malaria_plot is not). Now that we’ve added this solution our app is a little tidyer, faster, and easier to change since all our code that runs the epicurve function is in one place.

Adding a facility selector

Lets move on to our next feature - a selector for specific facilities. We’ll implement another parameter into our function so we can pass this as an argument from our code. Lets look at doing this first - it just operates off the same principles as the other parameters we’ve set up. Lets update and test our function.

plot_epicurve <- function(data, district = "All", agegroup = "malaria_tot", facility = "All") {
  
  if (!("All" %in% district)) {
    data <- data %>%
      filter(District %in% district)
    
    plot_title_district <- stringr::str_glue("{paste0(district, collapse = ', ')} districts")
    
  } else {
    
    plot_title_district <- "all districts"
    
  }
  
  # if no remaining data, return NULL
  if (nrow(data) == 0) {
    
    return(NULL)
  }
  
  data <- data %>%
    filter(age_group == agegroup)
  
  
  # if no remaining data, return NULL
  if (nrow(data) == 0) {
    
    return(NULL)
  }
  
  if (agegroup == "malaria_tot") {
      agegroup_title <- "All ages"
  } else {
    agegroup_title <- stringr::str_glue("{str_remove(agegroup, 'malaria_rdt')} years")
  }
  
    if (!("All" %in% facility)) {
    data <- data %>%
      filter(location_name == facility)
    
    plot_title_facility <- facility
    
  } else {
    
    plot_title_facility <- "all facilities"
    
  }
  
  # if no remaining data, return NULL
  if (nrow(data) == 0) {
    
    return(NULL)
  }

  
  
  ggplot(data, aes(x = data_date, y = cases_reported)) +
    geom_col(width = 1, fill = "darkred") +
    theme_minimal() +
    labs(
      x = "date",
      y = "number of cases",
      title = stringr::str_glue("Malaria cases - {plot_title_district}; {plot_title_facility}"),
      subtitle = agegroup_title
    )
  
  
  
}

Let’s test it:

plot_epicurve(malaria_data, district = "Spring", agegroup = "malaria_rdt_0-4", facility = "Facility 1")

With all the facilites in our data, it isn’t very clear which facilities correspond to which districts - and the end user won’t know either. This might make using the app quite unintuitive. For this reason, we should make the facility options in the UI change dynamically as the user changes the district - so one filters the other! Since we have so many variables that we’re using in the options, we might also want to generate some of our options for the ui in our global.R file from the data. For example, we can add this code chunk to global.R after we’ve read our data in:

all_districts <- c("All", unique(malaria_data$District))

# data frame of location names by district
facility_list <- malaria_data %>%
  group_by(location_name, District) %>%
  summarise() %>% 
  ungroup()

Let’s look at them:

all_districts
## [1] "All"     "Spring"  "Bolo"    "Dingo"   "Barnard"
facility_list
## # A tibble: 65 x 2
##    location_name District
##    <chr>         <chr>   
##  1 Facility 1    Spring  
##  2 Facility 10   Bolo    
##  3 Facility 11   Spring  
##  4 Facility 12   Dingo   
##  5 Facility 13   Bolo    
##  6 Facility 14   Dingo   
##  7 Facility 15   Barnard 
##  8 Facility 16   Barnard 
##  9 Facility 17   Barnard 
## 10 Facility 18   Bolo    
## # ... with 55 more rows

We can pass these new variables to the ui without any issue, since they are globally visible by both the server and the ui! Lets update our UI:

ui <- fluidPage(

  titlePanel("Malaria facility visualisation app"),

  sidebarLayout(

    sidebarPanel(
         # selector for district
         selectInput(
              inputId = "select_district",
              label = "Select district",
              choices = all_districts,
              selected = "All",
              multiple = FALSE
         ),
         # selector for age group
         selectInput(
              inputId = "select_agegroup",
              label = "Select age group",
              choices = c(
                   "All ages" = "malaria_tot",
                   "0-4 yrs" = "malaria_rdt_0-4",
                   "5-14 yrs" = "malaria_rdt_5-14",
                   "15+ yrs" = "malaria_rdt_15"
              ), 
              selected = "All",
              multiple = FALSE
         ),
         # selector for facility
         selectInput(
           inputId = "select_facility",
           label = "Select Facility",
           choices = c("All", facility_list$location_name),
           selected = "All"
         ),
         
         # horizontal line
         hr(),
         downloadButton(
           outputId = "download_epicurve",
           label = "Download plot"
         )

    ),

    mainPanel(
      # epicurve goes here
      plotOutput("malaria_epicurve"),
      br(),
      hr(),
      p("Welcome to the malaria facility visualisation app! To use this app, manipulate the widgets on the side to change the epidemic curve according to your preferences! To download a high quality image of the plot you've created, you can also download it with the download button. To see the raw data, use the raw data tab for an interactive form of the table. The data dictionary is as follows:"),
      tags$ul(
        tags$li(tags$b("location_name"), " - the facility that the data were collected at"),
        tags$li(tags$b("data_date"), " - the date the data were collected at"),
        tags$li(tags$b("submitted_daate"), " - the date the data were submitted at"),
        tags$li(tags$b("Province"), " - the province the data were collected at (all 'North' for this dataset)"),
        tags$li(tags$b("District"), " - the district the data were collected at"),
        tags$li(tags$b("age_group"), " - the age group the data were collected for (0-5, 5-14, 15+, and all ages)"),
        tags$li(tags$b("cases_reported"), " - the number of cases reported for the facility/age group on the given date")
      )
      
    )
    
  )
)

Notice how we’re now passing variables for our choices instead of hard coding them in the ui! This might make our code more compact as well! Lastly, we’ll have to update the server. It will be easy to update our function to incorporate our new input (we just have to pass it as an argument to our new parameter), but we should remember we also want the ui to update dynamically when the user changes the selected district. It is important to understand here that we can change the parameters and behaviour of widgets while the app is running, but this needs to be done in the server. We need to understand a new way to output to the server to learn how to do this.

The functions we need to understand how to do this are known as observer functions, and are similar to reactive functions in how they behave. They have one key difference though:

  • Reactive functions do not directly affect outputs, and produce objects that can be seen in other locations in the server
  • Observer functions can affect server outputs, but do so via side effects of other functions. (They can also do other things, but this is their main function in practice)

Similar to reactive functions, there are two flavours of observer functions, and they are divided by the same logic that divides reactive functions:

  1. observe() - this function runs whenever any inputs used inside of it change
  2. observeEvent() - this function runs when a user-specified input changes

We also need to understand the shiny-provided functions that update widgets. These are fairly straightforward to run - they first take the session object from the server function (this doesn’t need to be understood for now), and then the inputId of the function to be changed. We then pass new versions of all parameters that are already taken by selectInput() - these will be automatically updated in the widget.

Lets look at an isolated example of how we could use this in our server. When the user changes the district, we want to filter our tibble of facilities by district, and update the choices to only reflect those that are available in that district (and an option for all facilities)

observe({
  
  if (input$select_district == "All") {
    new_choices <- facility_list$location_name
  } else {
    new_choices <- facility_list %>%
      filter(District == input$select_district) %>%
      pull(location_name)
  }
  
  new_choices <- c("All", new_choices)
  
  updateSelectInput(session, inputId = "select_facility",
                    choices = new_choices)
  
})

And that’s it! we can add it into our server, and that behaviour will now work. Here’s what our new server should look like:

server <- function(input, output, session) {
  
  malaria_plot <- reactive({
    plot_epicurve(malaria_data, district = input$select_district, agegroup = input$select_agegroup, facility = input$select_facility)
  })
  
  
  
  observe({
    
    if (input$select_district == "All") {
      new_choices <- facility_list$location_name
    } else {
      new_choices <- facility_list %>%
        filter(District == input$select_district) %>%
        pull(location_name)
    }
    
    new_choices <- c("All", new_choices)
    
    updateSelectInput(session, inputId = "select_facility",
                      choices = new_choices)
    
  })
  
  
  output$malaria_epicurve <- renderPlot(
    malaria_plot()
  )
  
  output$download_epicurve <- downloadHandler(
    
    filename = function() {
      stringr::str_glue("malaria_epicurve_{input$select_district}.png")
    },
    
    content = function(file) {
      ggsave(file, 
             malaria_plot(),
             width = 8, height = 5, dpi = 300)
    }
    
  )
  
  
  
}

Adding another tab with a table

Now we’ll move on to the last component we want to add to our app. We’ll want to separate our ui into two tabs, one of which will have an interactive table where the user can see the data they are making the epidemic curve with. To do this, we can use the packaged ui elements that come with shiny relevant to tabs. On a basic level, we can enclose most of our main panel in this general structure:

# ... the rest of ui

mainPanel(
  
  tabsetPanel(
    type = "tabs",
    tabPanel(
      "Epidemic Curves",
      ...
    ),
    tabPanel(
      "Data",
      ...
    )
  )
)

Lets apply this to our ui. We also will want to use the DT package here - this is a great package for making interactive tables from pre-existing data. We can see it being used for DT::datatableOutput() in this example.

ui <- fluidPage(
     
     titlePanel("Malaria facility visualisation app"),
     
     sidebarLayout(
          
          sidebarPanel(
               # selector for district
               selectInput(
                    inputId = "select_district",
                    label = "Select district",
                    choices = all_districts,
                    selected = "All",
                    multiple = FALSE
               ),
               # selector for age group
               selectInput(
                    inputId = "select_agegroup",
                    label = "Select age group",
                    choices = c(
                         "All ages" = "malaria_tot",
                         "0-4 yrs" = "malaria_rdt_0-4",
                         "5-14 yrs" = "malaria_rdt_5-14",
                         "15+ yrs" = "malaria_rdt_15"
                    ), 
                    selected = "All",
                    multiple = FALSE
               ),
               # selector for facility
               selectInput(
                    inputId = "select_facility",
                    label = "Select Facility",
                    choices = c("All", facility_list$location_name),
                    selected = "All"
               ),
               
               # horizontal line
               hr(),
               downloadButton(
                    outputId = "download_epicurve",
                    label = "Download plot"
               )
               
          ),
          
          mainPanel(
               tabsetPanel(
                    type = "tabs",
                    tabPanel(
                         "Epidemic Curves",
                         plotOutput("malaria_epicurve")
                    ),
                    tabPanel(
                         "Data",
                         DT::dataTableOutput("raw_data")
                    )
               ),
               br(),
               hr(),
               p("Welcome to the malaria facility visualisation app! To use this app, manipulate the widgets on the side to change the epidemic curve according to your preferences! To download a high quality image of the plot you've created, you can also download it with the download button. To see the raw data, use the raw data tab for an interactive form of the table. The data dictionary is as follows:"),
               tags$ul(
                    tags$li(tags$b("location_name"), " - the facility that the data were collected at"),
                    tags$li(tags$b("data_date"), " - the date the data were collected at"),
                    tags$li(tags$b("submitted_daate"), " - the date the data were submitted at"),
                    tags$li(tags$b("Province"), " - the province the data were collected at (all 'North' for this dataset)"),
                    tags$li(tags$b("District"), " - the district the data were collected at"),
                    tags$li(tags$b("age_group"), " - the age group the data were collected for (0-5, 5-14, 15+, and all ages)"),
                    tags$li(tags$b("cases_reported"), " - the number of cases reported for the facility/age group on the given date")
               )
               
               
          )
     )
)

Now our app is arranged into tabs! Lets make the necessary edits to the server as well. Since we dont need to manipulate our dataset at all before we render it this is actually very simple - we just render the malaria_data dataset via DT::renderDT() to the ui!

server <- function(input, output, session) {
  
  malaria_plot <- reactive({
    plot_epicurve(malaria_data, district = input$select_district, agegroup = input$select_agegroup, facility = input$select_facility)
  })
  
  
  
  observe({
    
    if (input$select_district == "All") {
      new_choices <- facility_list$location_name
    } else {
      new_choices <- facility_list %>%
        filter(District == input$select_district) %>%
        pull(location_name)
    }
    
    new_choices <- c("All", new_choices)
    
    updateSelectInput(session, inputId = "select_facility",
                      choices = new_choices)
    
  })
  
  
  output$malaria_epicurve <- renderPlot(
    malaria_plot()
  )
  
  output$download_epicurve <- downloadHandler(
    
    filename = function() {
      stringr::str_glue("malaria_epicurve_{input$select_district}.png")
    },
    
    content = function(file) {
      ggsave(file, 
             malaria_plot(),
             width = 8, height = 5, dpi = 300)
    }
    
  )
  
  # render data table to ui
  output$raw_data <- DT::renderDT(
    malaria_data
  )
  
  
}

43.7 Sharing shiny apps

Now that you’ve developed your app, you probably want to share it with others - this is the main advantage of shiny after all! We can do this by sharing the code directly, or we could publish on a server. If we share the code, others will be able to see what you’ve done and build on it, but this will negate one of the main advantages of shiny - it can eliminate the need for end-users to maintain an R installation. For this reason, if you’re sharing your app with users who are not comfortable with R, it is much easier to share an app that has been published on a server.

If you’d rather share the code, you could make a .zip file of the app, or better yet, publish your app on github and add collaborators. You can refer to the section on github for further information here.

However, if we’re publishing the app online, we need to do a little more work. Ultimately, we want your app to be able to be accessed via a web URL so others can get quick and easy access to it. Unfortunately, to publish you app on a server, you need to have access to a server to publish it on! There are a number of hosting options when it comes to this:

  • shinyapps.io: this is the easiest place to publish shiny apps, as it has the smallest amount of configuration work needed, and has some free, but limited licenses.

  • RStudio Connect: this is a far more powerful version of an R server, that can perform many operations, including publishing shiny apps. It is however, harder to use, and less recommended for first-time users.

For the purposes of this document, we will use shinyapps.io, since it is easier for first time users. You can make a free account here to start - there are also different price plans for server licesnses if needed. The more users you expect to have, the more expensive your price plan may have to be, so keep this under consideration. If you’re looking to create something for a small set of individuals to use, a free license may be perfectly suitable, but a public facing app may need more licenses.

First we should make sure our app is suitable for publishing on a server. In your app, you should restart your R session, and ensure that it runs without running any extra code. This is important, as an app that requires package loading, or data reading not defined in your app code won’t run on a server. Also note that you can’t have any explicit file paths in your app - these will be invalid in the server setting - using the here package solves this issue very well. Finally, if you’re reading data from a source that requires user-authentication, such as your organisation’s servers, this will not generally work on a server. You will need to liase with your IT department to figure out how to whitelist the shiny server here.

signing up for account

Once you have your account, you can navigate to the tokens page under Accounts. Here you will want to add a new token - this will be used to deploy your app.

From here, you should note that the url of your account will reflect the name of your app - so if your app is called my_app, the url will be appended as xxx.io/my_app/. Choose your app name wisely! Now that you are all ready, click deploy - if successful this will run your app on the web url you chose!

something on making apps in documents?

43.8 Further reading

So far, we’ve covered a lot of aspects of shiny, and have barely scratched the surface of what is on offer for shiny. While this guide serves as an introduction, there is loads more to learn to fully understand shiny. You should start making apps and gradually add more and more functionality

(PART) Miscellaneous

44 Writing functions

44.1 Preparation

Load packages

This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library() from base R. See the page on R basics for more information on R packages.

Import data

We import the dataset of cases from a simulated Ebola epidemic. If you want to download the data to follow step-by-step, see instructions in the [Download book and data] page. The dataset is imported using the import() function from the rio package. See the page on Import and export for various ways to import data.

We will also use in the last part of this page some data on H7N9 flu from 2013.

44.2 Functions

Functions are helpful in programming since they allow to make codes easier to understand, somehow shorter and less prone to errors (given there were no errors in the function itself).

If you have come so far to this handbook, it means you have came across endless functions since in R, every operation is a function call +, for, if, [, $, { …. For example x + y is the same as'+'(x, y)

R is one the languages that offers the most possibility to work with functions and give enough tools to the user to easily write them. We should not think about functions as fixed at the top or at the end of the programming chain, R offers the possibility to use them as if they were vectors and even to use them inside other functions, lists…

Lot of very advanced resources on functional programming exist and we will only give here an insight to help you start with functional programming with short practical examples. You are then encouraged to visit the links on references to read more about it.

44.3 Why would you use a function?

Before answering this question, it is important to note that you have already had tips to get to write your very first R functions in the page on Iteration, loops, and lists of this handbook. In fact, use of “if/else” and loops is often a core part of many of our functions since they easily help to either broaden the application of our code allowing multiple conditions or to iterate codes for repeating tasks.

  • I am repeating multiple times the same block of code to apply it to a different variable or data?

  • Getting rid of it will it substantially shorten my overall code and make it run quicker?

  • Is it possible that the code I have written is used again but with a different value at many places of the code?

If the answer to one of the previous questions is “YES”, then you probably need to write a function

44.4 How does R build functions?

Functions in R have three main components:

  • the formals() which is the list of arguments which controls how we can call the function

  • the body() that is the code inside the function i.e. within the brackets or following the parenthesis depending on how we write it

and,

  • the environment() which will help locate the function’s variables and determines how the function finds value.

Once you have created your function, you can verify each of these components by calling the function associated.

44.5 Basic syntax and structure

  • A function will need to be named properly so that its job is easily understandable as soon as we read its name. Actually this is already the case with majority of the base R architecture. Functions like mean(), print(), summary() have names that are very straightforward

  • A function will need arguments, such as the data to work on and other objects that can be static values among other options

  • And finally a function will give an output based on its core task and the arguments it has been given. Usually we will use the built-in functions as print(), return()… to produce the output. The output can be a logical value, a number, a character, a data frame…in short any kind of R object.

Basically this is the composition of a function:

function_name <- function(argument_1, argument_2, argument_3){
  
           function_task
  
           return(output)
}

We can create our first function that will be called contain_covid19().

contain_covid19 <- function(barrier_gest, wear_mask, get_vaccine){
  
                            if(barrier_gest == "yes" & wear_mask == "yes" & get_vaccine == "yes" ) 
       
                            return("success")
  
  else("please make sure all are yes, this pandemic has to end!")
}

We can then verify the components of our newly created function.

formals(contain_covid19)
## $barrier_gest
## 
## 
## $wear_mask
## 
## 
## $get_vaccine
body(contain_covid19)
## {
##     if (barrier_gest == "yes" & wear_mask == "yes" & get_vaccine == 
##         "yes") 
##         return("success")
##     else ("please make sure all are yes, this pandemic has to end!")
## }
environment(contain_covid19)
## <environment: R_GlobalEnv>

Now we will test our function. To call our written function, you use it as you use all R functions i.e by writing the function name and adding the required arguments.

contain_covid19(barrier_gest = "yes", wear_mask = "yes", get_vaccine = "yes")
## [1] "success"

We can write again the name of each argument for precautionary reasons. But without specifying them, the code should work since R has in memory the positioning of each argument. So as long as you put the values of the arguments in the correct order, you can skip writing the arguments names when calling the functions.

contain_covid19("yes", "yes", "yes")
## [1] "success"

Then let’s look what happens if one of the values is "no" or not "yes".

contain_covid19(barrier_gest = "yes", wear_mask = "yes", get_vaccine = "no")
## [1] "please make sure all are yes, this pandemic has to end!"

If we provide an argument that is not recognized, we get an error:

contain_covid19(barrier_gest = "sometimes", wear_mask = "yes", get_vaccine = "no")

Error in contain_covid19(barrier_gest = "sometimes", wear_mask = "yes", : could not find function "contain_covid19"

NOTE: Some functions (most of time very short and straightforward) may not need a name and can be used directly on a line of code or inside another function to do quick task. They are called anonymous functions .

For instance below is a first anonymous function that keeps only character variables the dataset.

linelist %>% 
  dplyr::slice_head(n=10) %>%  #equivalent to R base "head" function and that return first n observation of the  dataset
  select(function(x) is.character(x)) 

Then another function that selects every second observation of our dataset (may be relevant when we have longitudinal data with many records per patient for instance after having ordered by date or visit). In this case, the proper function writing outside dplyr would be function (x) (x%%2 == 0) to apply to the vector containing all row numbers.

linelist %>%   
   slice_head(n=20) %>% 
   tibble::rownames_to_column() %>% # add indices of each obs as rownames to clearly see the final selection
   filter(row_number() %%2 == 0)

A possible base R code for the same task would be:

linelist_firstobs <- head(linelist, 20)

linelist_firstobs[base::Filter(function(x) (x%%2 == 0), seq(nrow(linelist_firstobs))),]

CAUTION: Though it is true that using functions can help us with our code, it can nevertheless be time consuming to write some functions or to fix one if it has not been thought thoroughly, written adequately and is returning errors as a result. For this reason it is often recommended to first write the R code, make sure it does what we intend it to do, and then transform it into a function with its three main components as listed above.

44.6 Examples

Return proportion tables for several columns

Yes, we already have nice functions in many packages allowing to summarize information in a very easy and nice way. But we will still try to make our own, in our first steps to getting used to writing functions.

In this example we want to show how writing a simple function would avoid you copy-pasting the same code multiple times.

proptab_multiple <- function(my_data, var_to_tab){
  
  #print the name of each variable of interest before doing the tabulation
  print(var_to_tab)

  with(my_data,
       rbind( #bind the results of the two following function by row
        #tabulate the variable of interest: gives only numbers
          table(my_data[[var_to_tab]], useNA = "no"),
          #calculate the proportions for each variable of interest and round the value to 2 decimals
         round(prop.table(table(my_data[[var_to_tab]]))*100,2)
         )
       )
}


proptab_multiple(linelist, "gender")
## [1] "gender"
##            f       m
## [1,] 2807.00 2803.00
## [2,]   50.04   49.96
proptab_multiple(linelist, "age_cat")
## [1] "age_cat"
##          0-4     5-9  10-14  15-19   20-29 30-49 50-69 70+
## [1,] 1095.00 1095.00 941.00 743.00 1073.00   754 95.00 6.0
## [2,]   18.87   18.87  16.22  12.81   18.49    13  1.64 0.1
proptab_multiple(linelist, "outcome")
## [1] "outcome"
##        Death Recover
## [1,] 2582.00 1983.00
## [2,]   56.56   43.44

TIP: As shown above, it is very important to comment your functions as you would do for the general programming. Bear in mind that a function’s aim is to make a code ready to read, shorter and more efficient. Then one should be able to understand what the function does just by reading its name and should have more details reading the comments.

A second option is to use this function in another one via a loop to make the process at once:

for(var_to_tab in c("gender","age_cat",  "outcome")){
  
  print(proptab_multiple(linelist, var_to_tab))
  
}
## [1] "gender"
##            f       m
## [1,] 2807.00 2803.00
## [2,]   50.04   49.96
## [1] "age_cat"
##          0-4     5-9  10-14  15-19   20-29 30-49 50-69 70+
## [1,] 1095.00 1095.00 941.00 743.00 1073.00   754 95.00 6.0
## [2,]   18.87   18.87  16.22  12.81   18.49    13  1.64 0.1
## [1] "outcome"
##        Death Recover
## [1,] 2582.00 1983.00
## [2,]   56.56   43.44

A simpler way could be using the base R “apply” instead of a “for loop” as expressed below:

TIP: R is often defined as a functional programming language and almost anytime you run a line of code you are using some built-in functions. A good habit to be more comfortable with writing functions is to often have an internal look at how the basic functions you are using daily are built. The shortcut to do so is selecting the function name and then clicking onCtrl+F2 or fn+F2 or Cmd+F2 (depending on your computer) .

44.7 Using purrr: writing functions that can be iteratively applied

Modify class of multiple columns in a dataset

Let’s say many character variables in the original linelist data need to be changes to “factor” for analysis and plotting purposes. Instead of repeating the step several times, we can just use lapply() to do the transformation of all variables concerned on a single line of code.

CAUTION: lapply() returns a list, thus its use may require an additional modification as a last step.

The same step can be done using map_if() function from the purrr package

linelist_factor2 <- linelist %>%
  purrr::map_if(is.character, as.factor)


linelist_factor2 %>%
        glimpse()
## List of 30
##  $ case_id             : Factor w/ 5888 levels "00031d","00086d",..: 2134 3022 396 4203 3084 4347 179 1241 5594 430 ...
##  $ generation          : num [1:5888] 4 4 2 3 3 3 4 4 4 4 ...
##  $ date_infection      : Date[1:5888], format: "2014-05-08" NA NA "2014-05-04" ...
##  $ date_onset          : Date[1:5888], format: "2014-05-13" "2014-05-13" "2014-05-16" "2014-05-18" ...
##  $ date_hospitalisation: Date[1:5888], format: "2014-05-15" "2014-05-14" "2014-05-18" "2014-05-20" ...
##  $ date_outcome        : Date[1:5888], format: NA "2014-05-18" "2014-05-30" NA ...
##  $ outcome             : Factor w/ 2 levels "Death","Recover": NA 2 2 NA 2 2 2 1 2 1 ...
##  $ gender              : Factor w/ 2 levels "f","m": 2 1 2 1 2 1 1 1 2 1 ...
##  $ age                 : num [1:5888] 2 3 56 18 3 16 16 0 61 27 ...
##  $ age_unit            : Factor w/ 2 levels "months","years": 2 2 2 2 2 2 2 2 2 2 ...
##  $ age_years           : num [1:5888] 2 3 56 18 3 16 16 0 61 27 ...
##  $ age_cat             : Factor w/ 8 levels "0-4","5-9","10-14",..: 1 1 7 4 1 4 4 1 7 5 ...
##  $ age_cat5            : Factor w/ 18 levels "0-4","5-9","10-14",..: 1 1 12 4 1 4 4 1 13 6 ...
##  $ hospital            : Factor w/ 6 levels "Central Hospital",..: 4 3 6 5 2 5 3 3 3 3 ...
##  $ lon                 : num [1:5888] -13.2 -13.2 -13.2 -13.2 -13.2 ...
##  $ lat                 : num [1:5888] 8.47 8.45 8.46 8.48 8.46 ...
##  $ infector            : Factor w/ 2697 levels "00031d","002e6c",..: 2594 NA NA 2635 180 1799 1407 195 NA NA ...
##  $ source              : Factor w/ 2 levels "funeral","other": 2 NA NA 2 2 2 2 2 NA NA ...
##  $ wt_kg               : num [1:5888] 27 25 91 41 36 56 47 0 86 69 ...
##  $ ht_cm               : num [1:5888] 48 59 238 135 71 116 87 11 226 174 ...
##  $ ct_blood            : num [1:5888] 22 22 21 23 23 21 21 22 22 22 ...
##  $ fever               : Factor w/ 2 levels "no","yes": 1 NA NA 1 1 1 NA 1 1 1 ...
##  $ chills              : Factor w/ 2 levels "no","yes": 1 NA NA 1 1 1 NA 1 1 1 ...
##  $ cough               : Factor w/ 2 levels "no","yes": 2 NA NA 1 2 2 NA 2 2 2 ...
##  $ aches               : Factor w/ 2 levels "no","yes": 1 NA NA 1 1 1 NA 1 1 1 ...
##  $ vomit               : Factor w/ 2 levels "no","yes": 2 NA NA 1 2 2 NA 2 2 1 ...
##  $ temp                : num [1:5888] 36.8 36.9 36.9 36.8 36.9 37.6 37.3 37 36.4 35.9 ...
##  $ time_admission      : Factor w/ 1072 levels "00:10","00:29",..: NA 308 746 415 514 589 609 297 409 387 ...
##  $ bmi                 : num [1:5888] 117.2 71.8 16.1 22.5 71.4 ...
##  $ days_onset_hosp     : num [1:5888] 2 1 2 2 1 1 2 1 1 2 ...

Iteratively produce graphs for different levels of a variable

We will produce here pie chart to look at the distribution of patient’s outcome in China during the H7N9 outbreak for each province. Instead of repeating the code for each of them, we will just apply a function that we will create.

#precising options for the use of highchart
options(highcharter.theme =   highcharter::hc_theme_smpl(tooltip = list(valueDecimals = 2)))


#create a function called "chart_outcome_province" that takes as argument the dataset and the name of the province for which to plot the distribution of the outcome.

chart_outcome_province <- function(data_used, prov){
  
  tab_prov <- data_used %>% 
    filter(province == prov,
           !is.na(outcome))%>% 
    group_by(outcome) %>% 
    count() %>%
    adorn_totals(where = "row") %>% 
    adorn_percentages(denominator = "col", )%>%
    mutate(
        perc_outcome= round(n*100,2))
  
  
  tab_prov %>%
    filter(outcome != "Total") %>% 
  highcharter::hchart(
    "pie", hcaes(x = outcome, y = perc_outcome),
    name = paste0("Distibution of the outcome in:", prov)
    )
  
}

chart_outcome_province(flu_china, "Shanghai")
chart_outcome_province(flu_china,"Zhejiang")
chart_outcome_province(flu_china,"Jiangsu")

Iteratively produce tables for different levels of a variable

Here we will create three indicators to summarize in a table and we would like to produce this table for each of the provinces. Our indicators are the delay between onset and hospitalization, the percentage of recovery and the median age of cases.

indic_1 <- flu_china %>% 
  group_by(province) %>% 
  mutate(
    date_hosp= strptime(date_of_hospitalisation, format = "%m/%d/%Y"),
    date_ons= strptime(date_of_onset, format = "%m/%d/%Y"), 
    delay_onset_hosp= as.numeric(date_hosp - date_ons)/86400,
    mean_delay_onset_hosp = round(mean(delay_onset_hosp, na.rm=TRUE ), 0)) %>%
  select(province, mean_delay_onset_hosp)  %>% 
  distinct()
     

indic_2 <-  flu_china %>% 
            filter(!is.na(outcome)) %>% 
            group_by(province, outcome) %>% 
            count() %>%
            pivot_wider(names_from = outcome, values_from = n) %>% 
    adorn_totals(where = "col") %>% 
    mutate(
        perc_recovery= round((Recover/Total)*100,2))%>% 
  select(province, perc_recovery)
    
    
    
indic_3 <-  flu_china %>% 
            group_by(province) %>% 
            mutate(
                    median_age_cases = median(as.numeric(age), na.rm = TRUE)
            ) %>% 
  select(province, median_age_cases)  %>% 
  distinct()
## Warning in median(as.numeric(age), na.rm = TRUE): NAs introduced by coercion
#join the three indicator datasets

table_indic_all <- indic_1 %>% 
  dplyr::left_join(indic_2, by = "province") %>% 
        left_join(indic_3, by = "province")


#print the indicators in a flextable


print_indic_prov <-  function(table_used, prov){
  
  #first transform a bit the dataframe for printing ease
  indic_prov <- table_used %>%
    filter(province==prov) %>%
    pivot_longer(names_to = "Indicateurs", cols = 2:4) %>% 
   mutate( indic_label = factor(Indicateurs,
   levels= c("mean_delay_onset_hosp","perc_recovery","median_age_cases"),
   labels=c("Mean delay onset-hosp","Percentage of recovery", "Median age of the cases"))
   ) %>% 
    ungroup(province) %>% 
    select(indic_label, value)
  

    tab_print <- flextable(indic_prov)  %>%
    theme_vanilla() %>% 
    flextable::fontsize(part = "body", size = 10) 
    
    
     tab_print <- tab_print %>% 
                  autofit()   %>%
                  set_header_labels( 
                indic_label= "Indicateurs", value= "Estimation") %>%
    flextable::bg( bg = "darkblue", part = "header") %>%
    flextable::bold(part = "header") %>%
    flextable::color(color = "white", part = "header") %>% 
    add_header_lines(values = paste0("Indicateurs pour la province de: ", prov)) %>% 
bold(part = "header")
 
 tab_print <- set_formatter_type(tab_print,
   fmt_double = "%.2f",
   na_str = "-")

tab_print 
    
}




print_indic_prov(table_indic_all, "Shanghai")
print_indic_prov(table_indic_all, "Jiangsu")

44.8 Tips and best Practices for well functioning functions

Functional programming is meant to ease code and facilitates its reading. It should produce the contrary. The tips below will help you having a clean code and easy to read code.

Naming and syntax

  • Avoid using character that could have been easily already taken by other functions already existing in your environment

  • It is recommended for the function name to be short and straightforward to understand for another reader

  • It is preferred to use verbs as the function name and nouns for the argument names.

Column names and tidy evaluation

If you want to know how to reference column names that are provided to your code as arguments, read this tidyverse programming guidance. Among the topics covered are tidy evaluation and use of the embrace {{ }} “double braces”

For example, here is a quick skeleton template code from page tutorial mentioned just above:

var_summary <- function(data, var) {
  data %>%
    summarise(n = n(), min = min({{ var }}), max = max({{ var }}))
}
mtcars %>% 
  group_by(cyl) %>% 
  var_summary(mpg)

Testing and Error handling

The more complicated a function’s task the higher the possibility of errors. Thus it is sometimes necessary to add some verification within the funtion to help quickly understand where the error is from and find a way t fix it.

  • It can be more than recommended to introduce a check on the missingness of one argument using missing(argument). This simple check can return “TRUE” or “FALSE” value.
contain_covid19_missing <- function(barrier_gest, wear_mask, get_vaccine){
  
  if (missing(barrier_gest)) (print("please provide arg1"))
  if (missing(wear_mask)) print("please provide arg2")
  if (missing(get_vaccine)) print("please provide arg3")


  if (!barrier_gest == "yes" | wear_mask =="yes" | get_vaccine == "yes" ) 
       
       return ("you can do better")
  
  else("please make sure all are yes, this pandemic has to end!")
}


contain_covid19_missing(get_vaccine = "yes")
## [1] "please provide arg1"
## [1] "please provide arg2"
## Error in contain_covid19_missing(get_vaccine = "yes"): argument "barrier_gest" is missing, with no default
  • Use stop() for more detectable errors.
contain_covid19_stop <- function(barrier_gest, wear_mask, get_vaccine){
  
  if(!is.character(barrier_gest)) (stop("arg1 should be a character, please enter the value with `yes`, `no` or `sometimes"))
  
  if (barrier_gest == "yes" & wear_mask =="yes" & get_vaccine == "yes" ) 
       
       return ("success")
  
  else("please make sure all are yes, this pandemic has to end!")
}


contain_covid19_stop(barrier_gest=1, wear_mask="yes", get_vaccine = "no")
## Error in contain_covid19_stop(barrier_gest = 1, wear_mask = "yes", get_vaccine = "no"): arg1 should be a character, please enter the value with `yes`, `no` or `sometimes
  • As we see when we run most of the built-in functions, there are messages and warnings that can pop-up in certain conditions. We can integrate those in our written functions by using the functions message() and warning().

  • We can handle errors also by using safely() which takes one function as an argument and executes it in a safe way. In fact the function will execute without stopping if it encounters an error. safely() returns as output a list with two objects which are the results and the error it “skipped”.

We can verify by first running the mean() as function, then run it with safely().

map(linelist, mean)
## $case_id
## [1] NA
## 
## $generation
## [1] 16.56165
## 
## $date_infection
## [1] NA
## 
## $date_onset
## [1] NA
## 
## $date_hospitalisation
## [1] "2014-11-03"
## 
## $date_outcome
## [1] NA
## 
## $outcome
## [1] NA
## 
## $gender
## [1] NA
## 
## $age
## [1] NA
## 
## $age_unit
## [1] NA
## 
## $age_years
## [1] NA
## 
## $age_cat
## [1] NA
## 
## $age_cat5
## [1] NA
## 
## $hospital
## [1] NA
## 
## $lon
## [1] -13.23381
## 
## $lat
## [1] 8.469638
## 
## $infector
## [1] NA
## 
## $source
## [1] NA
## 
## $wt_kg
## [1] 52.64487
## 
## $ht_cm
## [1] 124.9633
## 
## $ct_blood
## [1] 21.20686
## 
## $fever
## [1] NA
## 
## $chills
## [1] NA
## 
## $cough
## [1] NA
## 
## $aches
## [1] NA
## 
## $vomit
## [1] NA
## 
## $temp
## [1] NA
## 
## $time_admission
## [1] NA
## 
## $bmi
## [1] 46.89023
## 
## $days_onset_hosp
## [1] NA
safe_mean <- safely(mean)
linelist %>% 
  map(safe_mean)
## $case_id
## $case_id$result
## [1] NA
## 
## $case_id$error
## NULL
## 
## 
## $generation
## $generation$result
## [1] 16.56165
## 
## $generation$error
## NULL
## 
## 
## $date_infection
## $date_infection$result
## [1] NA
## 
## $date_infection$error
## NULL
## 
## 
## $date_onset
## $date_onset$result
## [1] NA
## 
## $date_onset$error
## NULL
## 
## 
## $date_hospitalisation
## $date_hospitalisation$result
## [1] "2014-11-03"
## 
## $date_hospitalisation$error
## NULL
## 
## 
## $date_outcome
## $date_outcome$result
## [1] NA
## 
## $date_outcome$error
## NULL
## 
## 
## $outcome
## $outcome$result
## [1] NA
## 
## $outcome$error
## NULL
## 
## 
## $gender
## $gender$result
## [1] NA
## 
## $gender$error
## NULL
## 
## 
## $age
## $age$result
## [1] NA
## 
## $age$error
## NULL
## 
## 
## $age_unit
## $age_unit$result
## [1] NA
## 
## $age_unit$error
## NULL
## 
## 
## $age_years
## $age_years$result
## [1] NA
## 
## $age_years$error
## NULL
## 
## 
## $age_cat
## $age_cat$result
## [1] NA
## 
## $age_cat$error
## NULL
## 
## 
## $age_cat5
## $age_cat5$result
## [1] NA
## 
## $age_cat5$error
## NULL
## 
## 
## $hospital
## $hospital$result
## [1] NA
## 
## $hospital$error
## NULL
## 
## 
## $lon
## $lon$result
## [1] -13.23381
## 
## $lon$error
## NULL
## 
## 
## $lat
## $lat$result
## [1] 8.469638
## 
## $lat$error
## NULL
## 
## 
## $infector
## $infector$result
## [1] NA
## 
## $infector$error
## NULL
## 
## 
## $source
## $source$result
## [1] NA
## 
## $source$error
## NULL
## 
## 
## $wt_kg
## $wt_kg$result
## [1] 52.64487
## 
## $wt_kg$error
## NULL
## 
## 
## $ht_cm
## $ht_cm$result
## [1] 124.9633
## 
## $ht_cm$error
## NULL
## 
## 
## $ct_blood
## $ct_blood$result
## [1] 21.20686
## 
## $ct_blood$error
## NULL
## 
## 
## $fever
## $fever$result
## [1] NA
## 
## $fever$error
## NULL
## 
## 
## $chills
## $chills$result
## [1] NA
## 
## $chills$error
## NULL
## 
## 
## $cough
## $cough$result
## [1] NA
## 
## $cough$error
## NULL
## 
## 
## $aches
## $aches$result
## [1] NA
## 
## $aches$error
## NULL
## 
## 
## $vomit
## $vomit$result
## [1] NA
## 
## $vomit$error
## NULL
## 
## 
## $temp
## $temp$result
## [1] NA
## 
## $temp$error
## NULL
## 
## 
## $time_admission
## $time_admission$result
## [1] NA
## 
## $time_admission$error
## NULL
## 
## 
## $bmi
## $bmi$result
## [1] 46.89023
## 
## $bmi$error
## NULL
## 
## 
## $days_onset_hosp
## $days_onset_hosp$result
## [1] NA
## 
## $days_onset_hosp$error
## NULL

As said previously, well commenting our codes is already a good way for having documentation in our work.

45 Directory interactions

In this page we cover common scenarios where you create, interact with, save, and import with directories (folders).

45.1 Preparation

fs package

The fs package is a tidyverse package that facilitate directory interactions, improving on some of the base R functions. In the sections below we will often use functions from fs.

pacman::p_load(
  fs,             # file/directory interactions
  rio,            # import/export
  here,           # relative file pathways
  tidyverse)      # data management and visualization

45.2 List files in a directory

To list just the file names in a directory you can use dir() from base R. For example, this command lists the file names of the files in the “population” subfolder of the “data” folder in an R project. The relative filepath is provided using here() (which you can read about more in the Import and export page).

# file names
dir(here("data", "gis", "population"))
## [1] "sle_admpop_adm3_2020.csv"                        "sle_population_statistics_sierraleone_2020.xlsx"

To list the full file paths of the directory’s files, you can use you can use dir_ls() from fs. A base R alternative is list.files().

# file paths
dir_ls(here("data", "gis", "population"))
## C:/Users/neale/OneDrive - Neale Batra/Documents/Analytics-LAPTOP-RS5P2IBO/R/Projects/R handbook/epiRhandbook_eng/data/gis/population/sle_admpop_adm3_2020.csv
## C:/Users/neale/OneDrive - Neale Batra/Documents/Analytics-LAPTOP-RS5P2IBO/R/Projects/R handbook/epiRhandbook_eng/data/gis/population/sle_population_statistics_sierraleone_2020.xlsx

To get all the metadata information about each file in a directory, (e.g. path, modification date, etc.) you can use dir_info() from fs.

This can be particularly useful if you want to extract the last modification time of the file, for example if you want to import the most recent version of a file. For an example of this, see the Import and export page.

# file info
dir_info(here("data", "gis", "population"))

Here is the data frame returned. Scroll to the right to see all the columns.

45.3 File information

To extract metadata information about a specific file, you can use file_info() from fs (or file.info() from base R).

file_info(here("data", "case_linelists", "linelist_cleaned.rds"))

Here we use the $ to index the result and return only the modification_time value.

file_info(here("data", "case_linelists", "linelist_cleaned.rds"))$modification_time
## [1] "2021-08-31 19:02:39 EDT"

45.4 Check if exists

R objects

You can use exists() from base R to check whether an R object exists within R (supply the object name in quotes).

exists("linelist")
## [1] TRUE

Note that some base R packages use generic object names like “data” behind the scenes, that will appear as TRUE unless inherit = FALSE is specified. This is one reason to not name your dataset “data”.

exists("data")
## [1] TRUE
exists("data", inherit = FALSE)
## [1] FALSE

If you are writing a function, you should use missing() from base R to check if an argument is present or not, instead of exists().

Directories

To check whether a directory exists, provide the file path (and file name) to is_dir() from fs. Scroll to the right to see that TRUE is printed.

is_dir(here("data"))
## C:/Users/neale/OneDrive - Neale Batra/Documents/Analytics-LAPTOP-RS5P2IBO/R/Projects/R handbook/epiRhandbook_eng/data 
##                                                                                                                  TRUE

An alternative is file.exists() from base R.

Files

To check if a specific file exists, use is_file() from fs. Scroll to the right to see that TRUE is printed.

is_file(here("data", "case_linelists", "linelist_cleaned.rds"))
## C:/Users/neale/OneDrive - Neale Batra/Documents/Analytics-LAPTOP-RS5P2IBO/R/Projects/R handbook/epiRhandbook_eng/data/case_linelists/linelist_cleaned.rds 
##                                                                                                                                                      TRUE

A base R alternative is file.exists().

45.5 Create

Directories

To create a new directory (folder) you can use dir_create() from fs. If the directory already exists, it will not be overwritten and no error will be returned.

dir_create(here("data", "test"))

An alternative is dir.create() from base R, which will show an error if the directory already exists. In contrast, dir_create() in this scenario will be silent.

Files

You can create an (empty) file with file_create() from fs. If the file already exists, it will not be over-written or changed.

file_create(here("data", "test.rds"))

A base R alternative is file.create(). But if the file already exists, this option will truncate it. If you use file_create() the file will be left unchanged.

Create if does not exists

UNDER CONSTRUCTION

45.6 Delete

R objects

Use rm() from base R to remove an R object.

Directories

Use dir_delete() from fs.

Files

You can delete files with file_delete() from fs.

45.7 Running other files

source()

To run one R script from another R script, you can use the source() command (from base R).

source(here("scripts", "cleaning_scripts", "clean_testing_data.R"))

This is equivalent to viewing the above R script and clicking the “Source” button in the upper-right of the script. This will execute the script but will do it silently (no output to the R console) unless specifically intended. See the page on [Interactive console] for examples of using source() to interact with a user via the R console in question-and-answer mode.

render()

render() is a variation on source() most often used for R markdown scripts. You provide the input = which is the R markdown file, and also the output_format = (typically either “html_document”, “pdf_document”, “word_document”, "")

See the page on Reports with R Markdown for more details. Also see the documentation for render() here or by entering ?render.

Run files in a directory

You can create a for loop and use it to source() every file in a directory, as identified with dir().

for(script in dir(here("scripts"), pattern = ".R$")) {   # for each script name in the R Project's "scripts" folder (with .R extension)
  source(here("scripts", script))                        # source the file with the matching name that exists in the scripts folder
}

If you only want to run certain scripts, you can identify them by name like this:

scripts_to_run <- c(
     "epicurves.R",
     "demographic_tables.R",
     "survival_curves.R"
)

for(script in scripts_to_run) {
  source(here("scripts", script))
}

Here is a comparison of the fs and base R functions.

Import files in a directory

See the page on Import and export for importing and exporting individual files.

Also see the Import and export page for methods to automatically import the most recent file, based on a date in the file name or by looking at the file meta-data.

See the page on Iteration, loops, and lists for an example with the package purrr demonstrating:

  • Splitting a data frame and saving it out as multiple CSV files
  • Splitting a data frame and saving each part as a separate sheet within one Excel workbook
  • Importing multiple CSV files and combining them into one dataframe
  • Importing an Excel workbook with multiple sheets and combining them into one dataframe

45.8 base R

See below the functions list.files() and dir(), which perform the same operation of listing files within a specified directory. You can specify ignore.case = or a specific pattern to look for.

list.files(path = here("data"))

list.files(path = here("data"), pattern = ".csv")
# dir(path = here("data"), pattern = ".csv")

list.files(path = here("data"), pattern = "evd", ignore.case = TRUE)

If a file is currently “open”, it will display in your folder with a tilde in front, like “~$hospital_linelists.xlsx”.

46 Version control and collaboration with Git and Github

This chapter presents an overview of using Git to collaborate with others. More extensive tutorials can be found at the bottom in the Resources section.

46.1 What is Git?

Git is a version control software that allows tracking changes in a folder. It can be used like the “track change” option in Word, LibreOffice or Google docs, but for all types of files. It is one of the most powerful and most used options for version control.

Why have I never heard of it? - While people with a developer background routinely learn to use version control software (Git, Mercurial, Subversion or others), few of us from quantitative disciplines are taught these skills. Consequently, most epidemiologists never hear of it during their studies, and have to learn it on the fly.

Wait, I heard of Github, is it the same? - Not exactly, but you often use them together, and we will show you how to. In short:

  • Git is the version control system, a piece of software. You can use it locally on your computer or to synchronize a folder with a host website. By default, one uses a terminal to give Git instructions in command-line.

  • You can use a Git client/interface to avoid the command-line and perform the same actions (at least for the simple, super common ones).

  • If you want to store your folder in a host website to collaborate with others, you may create an account at Github, Gitlab, Bitbucket or others.

So you could use the client/interface Github Desktop, which uses Git in the background to manage your files, both locally on your computer, and remotely on a Github server.

46.2 Why use the combo Git and Github?

Using Git facilitates:

  1. Archiving documented versions with incremental changes so that you can easily revert backwards to any previous state
  2. Having parallel branches, i.e. developing/“working” versions with structured ways to integrate the changes after review

This can be done locally on your computer, even if you don’t collaborate with other people. Have you ever:

  • regretted having deleted a section of code, only to realize two months later that you actually needed it?

  • come back on a project that had been on pause and attempted to remember whether you had made that tricky modification in one of the models?

  • had a file model_1.R and another file model_1_test.R and a file model_1_not_working.R to try things out?

  • had a file report.Rmd, a file report_full.Rmd, a file report_true_final.Rmd, a file report_final_20210304.Rmd, a file report_final_20210402.Rmd and cursed your archiving skills?

Git will help with all that, and is worth to learn for that alone.

However, it becomes even more powerful when used with a online repository such as Github to support collaborative projects. This facilitates:

  • Collaboration: others can review, comment on, and accept/decline changes

  • Sharing your code, data, and outputs, and invite feedback from the public (or privately, with your team)

and avoids:

  • “Oops, I forgot to send the last version and now you need to redo two days worth of work on this new file”

  • Mina, Henry and Oumar all worked at the same time on one script and need to manually merge their changes

  • Two people try to modify the same file on Dropbox and Sharepoint and this creates a synchronization error.

This sounds complicated, I am not a programmer

It can be. Examples of advanced uses can be quite scary. However, much like R, or even Excel, you don’t need to become an expert to reap the benefits of the tool. Learning a small number of functions and notions lets you track your changes, synchronize your files on a online repository and collaborate with your colleagues in a very short amount of time.

Due to the learning curve, emergency context may not be the best of time to learn these tools. But learning can be achieved by steps. Once you acquire a couple of notions, your workflow can be quite efficient and fast. If you are not working on a project where collaborating with people through Git is a necessity, it is actually a good time to get confident using it in solo before diving in collaboration.

46.3 Setup

Install Git

Git is the engine behind the scenes on your computer, which tracks changes, branches (versions), merges, and reverting. You must first install Git from https://git-scm.com/downloads.

Github account

Sign-up for a free account at github.com.

You may be offered to set-up two-factor authentication with an app on your phone. Read more in the Github help documents.

If you use Github Desktop, you can enter your Gitub credentials after installation following these steps. If you don’t do it know, credentials will be asked later when you try to clone a project from Github.

46.4 Vocabulary, concepts and basic functions

As when learning R, there is a bit of vocabulary to remember to understand Git. Here are the basics to get you going / interactive tutorial. In the next sections, we will show how to use interfaces, but it is good to have the vocabulary and concepts in mind, to build your mental model, and as you’ll need them when using interfaces anyway.

Repository

A Git repository (“repo”) is a folder that contains all the sub-folders and files for your project (data, code, images, etc.) and their revision histories. When you begin tracking changes in the repository with it, Git will create a hidden folder that contains all tracking information. A typical Git repository is your R Project folder (see handbook page on R projects).

We will show how to create (initialize) a Git repository from Github, Github Desktop or Rstudio in the next sections.

Commits

A commit is a snapshot of the project at a given time. When you make a change to the project, you will make a new commit to track the changes (the delta) made to your files. For example, perhaps you edited some lines of code and updated a related dataset. Once your changes are saved, you can bundle these changes together into one “commit”.

Each commit has a unique ID (a hash). For version control purposes, you can revert your project back in time based on commits, so it is best to keep them relatively small and coherent. You will also attach a brief description of the changes called the “commit message”.

Staged changes? To stage changes is to add them to the staging area in preparation for the next commit. The idea is that you can finely decide which changes to include in a given commit. For example, if you worked on model specification in one script, and later on a figure in another script, it would make sense to have two different commits (it would be easier in case you wanted to revert the changes on the figure but not the model).

Branches

A branch represents an independent line of changes in your repo, a parallel, alternate version of your project files.

Branches are useful to test changes before they are incorporated into the main branch, which is usually the primary/final/“live” version of your project. When you are done experimenting on a branch, you can bring the changes into your main branch, by merging it, or delete it, if the changes were not so successful.

Note: you do not have to collaborate with other people to use branches, nor need to have a remote online repository.

Local and remote repositories

To clone is to create a copy of a Git repository in another place.

For example, you can clone a online repository from Github locally on your computer, or begin with a local repository and clone it online to Github.

When you have cloned a repository, the project files exist in two places:

  • the LOCAL repository on your physical computer. This is where you make the actual changes to the files/code.

  • the REMOTE, online repository: the versions of your project files in the Github repository (or on any other web host).

To synchronize these repositories, we will use more functions. Indeed, unlike Sharepoint, Dropbox or other synchronizing software, Git does not automatically update your local repository based or what’s online, or vice-versa. You get to choose when and how to synchronize.

  • git fetch downloads the new changes from the remote repository but does not change your local repository. Think of it as checking the state of the remote repository.

  • git pull downloads the new changes from the remote repositories and update your local repository.

  • When you have made one or several commits locally, you can git push the commits to the remote repository. This sends your changes on Github so that other people can see and pull them if they want to.

46.5 Get started: create a new repository

There are many ways to create new repositories. You can do it from the console, from Github, from an interface.

Two general approaches to set-up are:

  • Create a new R Project from an existing or new Github repository (preferred for beginners), or
  • Create a Github repository for an existing R project

Start-up files

When you create a new repository, you can optionally create all of the below files, or you can add them to your repository at a later stage. They would typically live in the “root” folder of the repository.

  • A README file is a file that someone can read to understand why your project exists and what else they should know to use it. It will be empty at first, but you should complete it later.

  • A .gitignore file is a text file where each line would contain folders or files that Git should ignore (not track changes). Read more about it and see examples here.

  • You can choose a license for your work, so that other people know under which conditions they can use or reproduce your work. For more information, see the Creative Commons licenses.

Create a new repository in Github

To create a new repository, log into Github and look for the green button to create a new repository. This now empty repository can be cloned locally to your computer (see next section).

You must choose if you want your repository to be public (visible to everyone on the internet) or private (only visible to those with permission). This has important implications if your data are sensitive. If your repository is private you will encounter some quotas in advanced special circumstances, such as if you are using Github actions to automatically run your code in the cloud.

Clone from a Github repository

You can clone an existing Github repository to create a new local R project on your computer.

The Github repository could be one that already exists and contains content, or could be an empty repository that you just created. In this latter case you are essentially creating the Github repo and local R project at the same time (see instructions above).

Note: if you do not have contributing rights on a Github repository, it is possible to first fork the repository to your profile, and then proceed with the other actions. Forking is explained at the end of this chapter, but we recommend that you read the other sections first.

Step 1: Navigate in Github to the repository, click on the green “Code” button and copy the HTTPS clone URL (see image below)

The next step can be performed in any interface. We will illustrate with Rstudio and Github desktop.

In Rstudio

In RStudio, start a new R project by clicking File > New Project > Version Control > Git

  • When prompted for the “Repository URL”, paste the HTTPS URL from Github
  • Assign the R project a short, informative name
  • Choose where the new R Project will be saved locally
  • Check “Open in new session” and click “Create project”

You are now in a new, local, RStudio project that is a clone of the Github repository. This local project and the Github repository are now linked.

In Github Desktop

  • Click on File > Clone a repository

  • Select the URL tab

  • Paste the HTTPS URL from Github in the first box

  • Select the folder in which you want to have your local repository

  • Click “CLONE”

New Github repo from existing R project

An alternative setup scenario is that you have an existing R project with content, and you want to create a Github repository for it.

  1. Create a new, empty Github repository for the project (see instructions above)
  2. Clone this repository locally (see HTTPS instructions above)
  3. Copy all the content from your pre-existing R project (codes, data, etc.) into this new empty, local, repository (e.g. use copy and paste).
  4. Open your new project in RStudio, and go to the Git pane. The new files should register as file changes, now tracked by Git. Therefore, you can bundle these changes as a commit and push them up to Github. Once pushed, the repository on Github will reflect all the files.

See the Github workflow section below for details on this process.

What does it look like now?

In RStudio

Once you have cloned a Github repository to a new R project, you now see in RStudio a “Git” tab. This tab appears in the same RStudio pane as your R Environment:

Please note the buttons circled in the image above, as they will be referenced later (from left to right):

  • Button to commit the saved file changes to the local branch (this will open a new window)
  • Blue arrow to pull (update your local version of the branch with any changes made on the remote/Github version of that branch)
  • Green arrow to push (send any commits/changes for your local version of the branch to the remote/Github version of that branch)
  • The Git tab in RStudio
  • Button to create a NEW branch using whichever local branch is shown to the right as the base. You almost always want to branch off from the main branch (after you first pull to update the main branch)
  • The branch you are currently working in
  • Changes you made to code or other files will appear below

In Github Desktop

Github Desktop is an independent application that allows you to manage all your repositories. When you open it, the interface allows you to choose the repository you want to work on, and then to perform basic Git actions from there.

46.6 Git + Github workflow

Process overview

Once you have completed the setup (described above), you will have a Github repo that is connected (cloned) to a local R project. The main branch (created by default) is the so-called “live” version of all the files. When you want to make modifications, it is a good practice to create a new branch from the main branch (like “Make a Copy”). This is a typical workflow in Git because creating a branch is easy and fast.

A typical workflow is as follow:

  1. Make sure that your local repository is up-to-date, update it if not

  2. Go to the branch you were working on previously, or create a new branch to try out some things

  3. Work on the files locally on your computer, make one or several commits to this branch

  4. Update the remote version of the branch with your changes (push)

  5. When you are satisfied with your branch, you can merge the online version of the working branch into the online “main” branch to transfer the changes

Other team members may be doing the same thing with their own branches, or perhaps contributing commits into your working branch as well.

We go through the above process step-by-step in more detail below. Here is a schematic we’ve developed - it’s in the format of a two-way table so it should help epidemiologists understand.

Here’s another diagram.

Note: until recently, the term “master” branch was used, but it is now referred to as “main” branch.

Image source

46.7 Create a new branch

When you select a branch to work on, Git resets your working directory the way it was the last time you were on this branch.

In Rstudio Git pane

Ensure you are in the “main” branch, and then click on the purple icon to create a new branch (see image above).

  • You will be prompted to name your branch with a one-word descriptive name (can use underscores if needed).
  • You will see that locally, you are still in the same R project, but you are no longer working on the “main” branch.
  • Once created, the new branch will also appear in the Github website as a branch.

You can visualize branches in the Git Pane in Rstudio after clicking on “History”

In Github Desktop

The process is very much similar, you are prompted to give your branch a name. After, you will be prompted to “Publish you branch to Github” to make the new branch appear in the remote repo as well.

In console

What is actually happening behind the scenes is that you create a new branch with git branch, then go to the branch with git checkout (i.e. tell Git that your next commits will occur there). From your git repository:

git branch my-new-branch  # Create the new branch branch
git checkout my-new-branch # Go to the branch
git checkout -b my-new-branch # Both at once (shortcut)

For more information about using the console, see the section on Git commands at the end.

46.8 Commit changes

Now you can edit code, add new files, update datasets, etc.

Every one of your changes is tracked, once the respective file is saved. Changed files will appear in the RStudio Git tab, in Github Desktop, or using the command git status in the terminal (see below).

Whenever you make substantial changes (e.g. adding or updating a section of code), pause and commit those changes. Think of a commit as a “batch” of changes related to a common purpose. You can always continue to revise a file after having committed changes on it.

Advice on commits: generally, it is better to make small commits, that can be easily reverted if a problem arises, to commit together modifications related to a common purpose. To achieve this, you will find that you should commit often. At the beginning, you’ll probably forget to commit often, but then the habit kicks in.

In Rstudio

The example below shows that, since the last commit, the R Markdown script “collaboration.Rmd” has changed, and several PNG images were added.

You might be wondering what the yellow, blue, green, and red squares next to the file names represent. Here is a snapshot from the RStudio cheatsheet that explains their meaning. Note that changes with yellow “?” can still be staged, committed, and pushed.

  • Press the “Commit” button in the Git tab, which will open a new window (shown below)

  • Click on a file name in the upper-left box

  • Review the changes you made to that file (highlighted below in green or red)

  • “Stage” the file, which will include those changes in the commit. Do this by checking the box next to the file name. Alternatively, you can highlight multiple file names and then click “Stage”

  • Write a commit message that is short but descriptive (required)

  • Press the “Commit” button. A pop-up box will appear showing success or an error message.

Now you can make more changes and more commits, as many times as you would like

In Github Desktop

You can see the list of the files that were changed on the left. If you select a text file, you will see a summary of the modifications that were made in the right pane (the view will not work on more complex files like .docs or .xlsx).

To stage the changes, just tick the little box near file names. When you have selected the files you want to add to this commit, give the commit a name, optionally a description and then click on the commit button.

In console

The two functions used behind the scenes are git add to select/stage files and git commit to actually do the commit.

git status # see the changes 

git add new_pages/collaboration.Rmd  # select files to commit (= stage the changes)

git commit -m "Describe commit from Github Desktop" # commit the changes with a message

git log  # view information on past commits

Amend a previous commit

What happens if you commit some changes, carry on working, and realize that you made changes that should “belong” to the past commit (in your opinion). Fear not! You can append these changes to your previous commit.

In Rstudio, it should be pretty obvious as there is a “Amend previous commit” box on the same line as the COMMIT button.

For some unclear reason, the functionality has not been implemented as such in Github Desktop, but there is a (conceptually awkward but easy) way around. If you have committed but not pushed your changes yet, an “UNDO” button appears just under the COMMIT button. Click on it and it will revert your commit (but keep your staged files and your commit message). Save your changes, add new files to the commit if necessary and commit again.

In the console:

git add [YOUR FILES] # Stage your new changes

git commit --amend  # Amend the previous commit

git commit --amend -m "An updated commit message"  # Amend the previous commit AND update the commit message

Note: think before modifying commits that are already public and shared with your collaborators.

46.9 Pull and push changes up to Github

“First PULL, then PUSH”

It is good practice to fetch and pull before you begin working on your project, to update the branch version on your local computer with any changes that have been made to it in the remote/Github version.

PULL often. Don’t hesitate. Always pull before pushing.

When your changes are made and committed and you are happy with the state of your project, you can push your commits up to the remote/Github version of your branch.

Rince and repeat while you are working on the repository.

Note: it is much easier to revert changes that were committed but not pushed (i.e. are still local) than to revert changes that were pushed to the remote repository (and perhaps already pulled by someone else), so it is better to push when you are done with introducing changes on the task that you were working on.

In Rstudio

PULL - First, click the “Pull” icon (downward arrow) which fetches and pulls at the same time.

PUSH - Clicking the green “Pull” icon (upward arrow). You may be asked to enter your Github username and password. The first time you are asked, you may need to enter two Git command lines into the Terminal:

  • git config –global user.email “ (your Github email address), and
  • git config –global user.name “Your Github username”

To learn more about how to enter these commands, see the section below on Git commands.

TIP: Asked to provide your password too often? See these chapters 10 & 11 of this tutorial to connect to a repository using a SSH key (more complicated)

In Github Desktop

Click on the “Fetch origin” button to check if there are new commits on the remote repository.

If Git finds new commits on the remote repository, the button will change into a “Pull” button. Because the same button is used to push and pull, you cannot push your changes if you don’t pull before.

You can go to the “History” tab (near the “Changes” tab) to see all commits (yours and others). This is a nice way of acquainting yourself with what your collaborators did. You can read the commit message, the description if there is one, and compare the code of the two files using the diff pane.

Once all remote changes have been pulled, and at least one local change has been committed, you can push by clicking on the same button.

Console

Without surprise, the commands are fetch, pull and push.

git fetch  # are there new commits in the remote directory?
git pull   # Bring remote commits into your local branch
git push   # Puch local commits of this branch to the remote branch

I want to pull but I have local work

This can happen sometimes: you made some changes on your local repository, but the remote repository has commits that you didn’t pull.

Git will refuse to pull because it might overwrite your changes. There are several strategies to keep your changes, well described in Happy Git with R, among which the two main ones are: - commit your changes, fetch remote changes, pull them in, resolve conflicts if needed (see section below), and push everything online - stash your changes, which sort of stores them aside, pull, unstash (restore), and then commit, solve any conflicts, and push.

If the files concerned by the remote changes and the files concerned by your local changes do not overlap, Git may solve conflicts automatically.

In Github Desktop, this can be done with buttons. To stash, go to Branch > Stash all changes.

46.10 Merge branch into Main

If you have finished making changes, you can begin the process of merging those changes into the main branch. Depending on your situation, this may be fast, or you may have deliberate review and approval steps involving teammates.

Locally in Github Desktop

One can merge branches locally using Github Desktop. First, go to (checkout) the branch that will be the recipient of the commits, in other words, the branch you want to update. Then go to the menu Branch > Merge into current branch and click. A box will allow you to select the branch you want to import from.

In console

First move back to the branch that will be the recipient of the changes. This is usually master, but it could be another branch. Then merge your working branch into master.

git checkout master  # Go back to master (or to the branch you want to move your )
git merge this_fancy_new_branch

This page shows a more advanced example of branching and explains a bit what is happening behind the scenes.

In Github: submitting pull requests

While it is totally possible to merge two branches locally, or without informing anybody, a merge may be discussed or investigated by several people before being integrated to the master branch. To help with the process, Github offers some discussion features around the merge: the pull request.

A pull request (a “PR”) is a request to merge one branch into another (in other words, a request that your working branch be pulled into the “main” branch). A pull request typically involves multiple commits. A pull request usually begins a conversation and review process before it is accepted and the branch is merged. For example, you can read pull request discussions on dplyr’s github.

You can submit a pull request (PR) directly form the website (as illustrated bellow) or from Github Desktop.

  • Go to Github repository (online)
  • View the tab “Pull Requests” and click the “New pull request” button
  • Select from the drop-down menu to merge your branch into main
  • Write a detailed Pull Request comment and click “Create Pull Request”.

In the image below, the branch “forests” has been selected to be merged into “main”:

Now you should be able to see the pull request (example image below):

  • Review the tab “Files changed” to see how the “main” branch would change if the branch were merged.
  • On the right, you can request a review from members of your team by tagging their Github ID. If you like, you can set the repository settings to require one approving review in order to merge into main.
  • Once the pull request is approved, a button to “Merge pull request” will become active. Click this.
  • Once completed, delete your branch as explained below.

Resolving conflicts

When two people modified the same line(s) at the same time, a merge conflict arises. Indeed, Git refuses to make a decision about which version to keep, but it helps you find where the conflict is. DO NOT PANIC. Most of the time, it is pretty straightforward to resolve.

For example, on Github:

After the merge raised a conflict, open the file in your favorite editor. The conflict will be indicated by series of characters:

The text between <<<<<<< HEAD and ======= comes from your local repository, and the one between ======= and >>>>>>> from the the other branch (which may be origin, master or any branch of your choice).

You need to decide which version of the code you prefer (or even write a third, including changes from both sides if pertinent), delete the rest and remove all the marks that Git added (<<<<<<< HEAD, =======, >>>>>>> origin/master/your_branch_name).

Then, save the file, stage it and commit it : this is the commit that makes the merged version “official”. Do not forget to push afterwards.

The more often you and your collaborators pull and push, the smaller the conflicts will be.

Note: If you feel at ease with the console, there are more advanced merging options (e.g. ignoring whitespace, giving a collaborator priority etc.).

Delete your branch

Once a branch was merged into master and is no longer needed, you can delete it.

46.10.0.1 Github + Rstudio

Go to the repository on Github and click the button to view all the branches (next to the drop-down to select branches). Now find your branch and click the trash icon next to it. Read more detail on deleting a branch here.

Be sure to also delete the branch locally on your computer. This will not happen automatically.

  • From RStudio, make sure you are in the Main branch
  • Switch to typing Git commands in the RStudio “Terminal” (the tab adjacent to the R console), and type: git branch -d branch_name, where “branch_name” is the name of your branch to be deleted
  • Refresh your Git tab and the branch should be gone

46.10.0.2 In Github Desktop

Just checkout the branch you want to delete, and go to the menu Branch > Delete.

Forking

You can fork a project if you would like to contribute to it but do not have the rights to do so, or if you just want to modify it for your personal use. A short description of forking can be found here.

On Github, click on the “Fork” button:

This will clone the original repository, but in your own profile. So now, there are two versions of the repository on Github: the original one, that you cannot modify, and the cloned version in your profile.

Then, you can proceed to clone your version of the online repository locally on your computer, using any of the methods described in previous sections. Then, you can create a new branch, make changes, commit and push them to your remote repository.

Once you are happy with the result you can create a Pull Request from Github or Github Desktop to begin the conversation with the owners/maintainers of the original repository.

What if you need some newer commits from the official repository?

Imagine that someone makes a critical modification to the official repository, which you want to include to your cloned version. It is possible to synchronize your fork with the official repository. It involves using the terminal, but it is not too complicated. You mostly need to remember that: - upstream = the official repository, the one that you could not modify - origin = your version of the repository on your Github profile

You can read this tutorial or follow along below:

First, type in your Git terminal (inside your repo):

git remote -v

If you have not yet configured the upstream repository you should see two lines, beginning by origin. They show the remote repo that fetch and push point to. Remember, origin is the conventional nickname for your own version of the repository on Github. For example:

Now, add a new remote repository:

git remote add upstream https://github.com/appliedepi/epirhandbook_eng.git

Here the address is the address that Github generates when you clone a repository (see section on cloning). Now you will have four remote pointers:

Now that the setup is done, whenever you want to get the changes from the original (upstream) repository, you just have to go (checkout) to the branch you want to update and type:

git fetch upstream # Get the new commits from the remote repository
git checkout the_branch_you_want_to_update
git merge upstream/the_branch_you_want_to_update  # Merge the upstream branch into your branch.
git push # Update your own version of the remote repo

If there are conflicts, you will have to solve them, as explained in the Resolving conflicts section.

Summary: forking is cloning, but on the Github server side. The rest of the actions are typical collaboration workflow actions (clone, push, pull, commit, merge, submit pull requests…).

Note: while forking is a concept, not a Git command, it also exist on other Web hosts, like Bitbucket.

46.11 What we learned

You have learned how to:

  • setup Git to track modifications in your folders,
  • connect your local repository to a remote online repository,
  • commit changes,
  • synchronize your local and remote repositories.

All this should get you going and be enough for most of your needs as epidemiologists. We usually do not have as advanced usage as developers.

However, know that should you want (or need) to go further, Git offers more power to simplify commit histories, revert one or several commits, cherry-pick commits, etc. Some of it may sound like pure wizardry, but now that you have the basics, it is easier to build on it.

Note that while the Git pane in Rstudio and Github Desktop are good for beginners / day-to-day usage in our line of work, they do not offer an interface to some of the intermediate / advanced Git functions. Some more complete interfaces allows you to do more with point-and-click (usually at the cost of a more complex layout).

Remember that since you can use any tool at any point to track your repository, you can very easily install an interface to try it out sometimes, or to perform some less common complex task occasionally, while preferring a simplified interface for the rest of time (e.g. using Github Desktop most of the time, and switching to SourceTree or Gitbash for some specific tasks).

46.12 Git commands

Where to enter commands

You enter commands in a Git shell.

Option 1 You can open a new Terminal in RStudio. This tab is next to the R Console. If you cannot type any text in it, click on the drop-down menu below “Terminal” and select “New terminal”. Type the commands at the blinking space in front of the dollar sign “$”.

Option 2 You can also open a shell (a terminal to enter commands) by clicking the blue “gears” icon in the Git tab (near the RStudio Environment). Select “Shell” from the drop-down menu. A new window will open where you can type the commands after the dollar sign “$”.

Option 3 Right click to open “Git Bash here” which will open the same sort of terminal, or open Git Bash form your application list. More beginner-friendly informations on Git Bash, how to find it and some bash commands you will need.

Sample commands

Below we present a few common git commands. When you use them, keep in mind which branch is active (checked-out), as that will change the action!

In the commands below, represents a branch name. represents the hash ID of a specific commit. represents a number. Do not type the < or > symbols.

Git command Action
git branch <name> Create a new branch with the name
git checkout <name> Switch current branch to
git checkout -b <name> Shortcut to create new branch and switch to it
git status See untracked changes
git add <file> Stage a file
git commit -m <message> Commit currently staged changes to current branch with message
git fetch Fetch commits from remote repository
git pull Pull commits from remote repository in current branch
git push Push local commits to remote directory
git switch An alternative to git checkout that is being phased in to Git
git merge <name> Merge branch into current branch
git rebase <name> Append commits from current branch on to branch

46.13 Resources

Much of this page was informed by this “Happy Git with R” website by Jenny Bryan. There is a very helpful section of this website that helps you troubleshoot common Git and R-related errors.

The Github.com documentation and start guide.

The RStudio “IDE” cheatsheet which includes tips on Git with RStudio.

https://ohi-science.org/news/github-going-back-in-time

Git commands for beginners

An interactive tutorial to learn Git commands.

https://www.freecodecamp.org/news/an-introduction-to-git-for-absolute-beginners-86fa1d32ff71/: good for learning the absolute basics to track changes in one folder on you own computer.

Nice schematics to understand branches: https://speakerdeck.com/alicebartlett/git-for-humans

Tutorials covering both basic and more advanced subjects

https://tutorialzine.com/2016/06/learn-git-in-30-minutes

https://dzone.com/articles/git-tutorial-commands-and-operations-in-git https://swcarpentry.github.io/git-novice/ (short course) https://rsjakob.gitbooks.io/git/content/chapter1.html

The Pro Git book is considered an official reference. While some chapters are ok, it is usually a bit technical. It is probably a good resource once you have used Git a bit and want to learn a bit more precisely what happens and how to go further.

47 Common errors

This page includes a running list of common errors and suggests solutions for troubleshooting them.

47.1 Interpreting error messages

R errors can be cryptic at times, so Google is your friend. Search the error message with “R” and look for recent posts in StackExchange.com, stackoverflow.com, community.rstudio.com, twitter (#rstats), and other forums used by programmers to filed questions and answers. Try to find recent posts that have solved similar problems.

If after much searching you cannot find an answer to your problem, consider creating a reproducible example (“reprex”) and posting the question yourself. See the page on Getting help for tips on how to create and post a reproducible example to forums.

47.2 Common errors

Below, we list some common errors and potential explanations/solutions. Some of these are borrowed from Noam Ross who analyzed the most common forum posts on Stack Overflow about R error messages (see analysis here)

Typo errors

Error: unexpected symbol in:
"  geom_histogram(stat = "identity")+
  tidyquant::geom_ma(n=7, size = 2, color = "red" lty"

If you see “unexpected symbol”, check for missing commas

Package errors

could not find function "x"...

This likely means that you typed the function name incorrectly, or forgot to install or load a package.

Error in select(data, var) : unused argument (var)

You think you are using dplyr::select() but the select() function has been masked by MASS::select() - specify dplyr:: or re-order your package loading so that dplyr is after all the others.

Other common masking errors stem from: plyr::summarise() and stats::filter(). Consider using the conflicted package.

Error in install.packages : ERROR: failed to lock directory ‘C:\Users\Name\Documents\R\win-library\4.0’ for modifying
Try removing ‘C:\Users\Name\Documents\R\win-library\4.0/00LOCK’

If you get an error saying you need to remove an “00LOCK” file, go to your “R” library in your computer directory (e.g. R/win-library/) and look for a folder called “00LOCK”. Delete this manually, and try installing the package again. A previous install process was probably interrupted, which led to this.

Object errors

No such file or directory:

If you see an error like this when you try to export or import: Check the spelling of the file and filepath, and if the path contains slashes make sure they are forward / and not backward \. Also make sure you used the correct file extension (e.g. .csv, .xlsx).

object 'x' not found 

This means that an object you are referencing does not exist. Perhaps code above did not run properly?

Error in 'x': subscript out of bounds

This means you tried to access something (an element of a vector or a list) that was not there.

Function syntax errors

# ran recode without re-stating the x variable in mutate(x = recode(x, OLD = NEW)
Error: Problem with `mutate()` input `hospital`.
x argument ".x" is missing, with no default
i Input `hospital` is `recode(...)`.

This error above (argument .x is missing, with no default) is common in mutate() if you are supplying a function like recode() or replace_na() where it expects you to provide the column name as the first argument. This is easy to forget.

Logic errors

Error in if

This likely means an if statement was applied to something that was not TRUE or FALSE.

Factor errors

#Tried to add a value ("Missing") to a factor (with replace_na operating on a factor)
Problem with `mutate()` input `age_cat`.
i invalid factor level, NA generated
i Input `age_cat` is `replace_na(age_cat, "Missing")`.invalid factor level, NA generated

If you see this error about invalid factor levels, you likely have a column of class Factor (which contains pre-defined levels) and tried to add a new value to it. Convert it to class Character before adding a new value.

Plotting errors

Error: Insufficient values in manual scale. 3 needed but only 2 provided. ggplot() scale_fill_manual() values = c(“orange”, “purple”) … insufficient for number of factor levels … consider whether NA is now a factor level…

Can't add x object

You probably have an extra + at the end of a ggplot command that you need to delete.

R Markdown errors

If the error message contains something like Error in options[[sprintf("fig.%s", i)]], check that your knitr options at the top of each chunk correctly use the out.width = or out.height = and not fig.width= and fig.height=.

Miscellaneous

Consider whether you re-arranged piped dplyr verbs and didn’t replace a pipe in the middle, or didn’t remove a pipe from the end after re-arranging.

47.3 Resources

This is another blog post that lists common R programming errors faced by beginners

48 Getting help

This page covers how to get help by posting a Github issue or by posting a reproducible example (“reprex”) to an online forum.

48.1 Github issues

Many R packages and projects have their code hosted on the website Github.com. You can communicate directly with authors via this website by posting an “Issue”.

Read more about how to store your work on Github in the page [Collaboration and Github].

On Github, each project is contained within a repository. Each repository contains code, data, outputs, help documentation, etc. There is also a vehicle to communicate with the authors called “Issues”.

See below the Github page for the incidence2 package (used to make epidemic curves). You can see the “Issues” tab highlighted in yellow. You can see that there are 5 open issues.

Once in the Issues tab, you can see the open issues. Review them to ensure your problem is not already being addressed. You can open a new issue by clicking the green button on the right. You will need a Github account to do this.

In your issue, follow the instructions below to provide a minimal, reproducible example. And please be courteous! Most people developing R packages and projects are doing so in their spare time (like this handbook!).

To read more advanced materials about handling issues in your own Github repository, check out the Github documentation on Issues.

48.2 Reproducible example

Providing a reproducible example (“reprex”) is key to getting help when posting in a forum or in a Github issue. People want to help you, but you have to give them an example that they can work with on their own computer. The example should:

  • Demonstrate the problem you encountered
  • Be minimal, in that it includes only the data and code required to reproduce your problem
  • Be reproducible, such that all objects (e.g. data), package calls (e.g. library() or p_load()) are included

Also, be sure you do not post any sensitive data with the reprex! You can create example data frames, or use one of the data frames built into R (enter data() to open a list of these datasets).

The reprex package

The reprex package can assist you with making a reproducible example:

  1. reprex is installed with tidyverse, so load either package
# install/load tidyverse (which includes reprex)
pacman::p_load(tidyverse)
  1. Begin an R script that creates your problem, step-by-step, starting from loading packages and data.
# load packages
pacman::p_load(
     tidyverse,  # data mgmt and vizualization
     outbreaks)  # example outbreak datasets

# flu epidemic case linelist
outbreak_raw <- outbreaks::fluH7N9_china_2013  # retrieve dataset from outbreaks package

# Clean dataset
outbreak <- outbreak_raw %>% 
     mutate(across(contains("date"), as.Date))

# Plot epidemic

ggplot(data = outbreak)+
     geom_histogram(
          mapping = aes(x = date_of_onset),
          binwidth = 7
     )+
  scale_x_date(
    date_format = "%d %m"
  )

Copy all the code to your clipboard, and run the following command:

reprex::reprex()

You will see an HTML output appear in the RStudio Viewer pane. It will contain all your code and any warnings, errors, or plot outputs. This output is also copied to your clipboard, so you can post it directly into a Github issue or a forum post.

  • If you set session_info = TRUE the output of sessioninfo::session_info() with your R and R package versions will be included
  • You can provide a working directory to wd =
  • You can read more about the arguments and possible variations at the documentation or by entering ?reprex

In the example above, the ggplot() command did not run because the arguemnt date_format = is not correct - it should be date_labels =.

Minimal data

The helpers need to be able to use your data - ideally they need to be able to create it with code.

To create a minumal dataset, consider anonymising and using only a subset of the observations.

UNDER CONSTRUCTION - you can also use the function dput() to create minimal dataset.

48.3 Posting to a forum

Read lots of forum posts. Get an understanding for which posts are well-written, and which ones are not.

  1. First, decide whether to ask the question at all. Have you thoroughly reviewed the forum website, trying various search terms, to see if your question has already been asked?

  2. Give your question an informative title (not “Help! this isn’t working”).

  3. Write your question:

  • Introduce your situation and problem
  • Link to posts of similar issues and explain how they do not answer your question
  • Include any relevant information to help someone who does not know the context of your work
  • Give a minimal reproducible example with your R session information
  • Use proper spelling, grammar, punctuation, and break your question into paragraphs so that it is easier to read
  1. Monitor your question once posted to respond to any requests for clarification. Be courteous and gracious - often the people answering are volunteering their time to help you. If you have a follow-up question consider whether it should be a separate posted question.

  2. Mark the question as answered, if you get an answer that meets the original request. This helps others later quickly recognize the solution.

Read these posts about how to ask a good question the Stack overflow code of conduct.

48.4 Resources

Tidyverse page on how to get help!

Tips on producing a minimal dataset

Documentation for the dput function

49 R on network drives

49.1 Overview

Using R on network or “company” shared drives can present additional challenges. This page contains approaches, common errors, and suggestions on troubleshooting gained from our experience working through these issues. These include tips for the particularly delicate situations involving R Markdown.

Using R on Network Drives: Overarching principles

  1. You must get administrator access for your computer. Setup RStudio specifically to run as administrator.
  2. Save packages to a library on a lettered drive (e.g. “C:”) when possible. Use a package library whose path begins with "\" as little as possible.
  3. the rmarkdown package must not be in a "\" package library, as then it can’t connect to TinyTex or Pandoc.

49.2 RStudio as administrator

When you click the RStudio icon to open RStudio, do so with a right-click. Depending on your machine, you may see an option to “Run as Administrator”. Otherwise, you may see an option to select Properties (then there should appear a window with the option “Compatibility”, and you can select a checkbox “Run as Administrator”).

49.3 Useful commands

Below are some useful commands when trying to troubleshoot issues using R on network drives.

You can return the path(s) to package libraries that R is using. They will be listed in the order that R is using to install/load/search for packages. Thus, if you want R to use a different default library, you can switch the order of these paths (see below).

# Find libraries
.libPaths()                   # Your library paths, listed in order that R installs/searches. 
                              # Note: all libraries will be listed, but to install to some (e.g. C:) you 
                              # may need to be running RStudio as an administrator (it won't appear in the 
                              # install packages library drop-down menu) 

You may want to switch the order of the package libraries used by R. For example if R is picking up a library location that begins with “\" and one that begins with a letter e.g. ”D:". You can adjust the order of .libPaths() with the following code.

# Switch order of libraries
# this can effect the priority of R finding a package. E.g. you may want your C: library to be listed first
myPaths <- .libPaths() # get the paths
myPaths <- c(myPaths[2], myPaths[1]) # switch them
.libPaths(myPaths) # reassign them

If you are having difficulties with R Markdown connecting to Pandoc, begin with this code to find out where RStudio thinks your Pandoc installation is.

# Find Pandoc
Sys.getenv("RSTUDIO_PANDOC")  # Find where RStudio thinks your Pandoc installation is

If you want to see which library a package is loading from, try the below code:

# Find a package
# gives first location of package (note order of your libraries)
find.package("rmarkdown", lib.loc = NULL, quiet = FALSE, verbose = getOption("verbose")) 

49.4 Troubleshooting common errors

“Failed to compile…tex in rmarkdown”

  • Check the installation of TinyTex, or install TinyTex to C: location. See the R basics page on how to install TinyTex.
# check/install tinytex, to C: location
tinytex::install_tinytex()
tinytex:::is_tinytex() # should return TRUE (note three colons)

Internet routines cannot be loaded

For example, Error in tools::startDynamicHelp() : internet routines cannot be loaded

  • Try selecting 32-bit version from RStudio via Tools/Global Options.
    • note: if 32-bit version does not appear in menu, make sure you are not using RStudio v1.2.
  • Alternatively, try uninstalling R and re-installing with different bit version (32 instead of 64)

C: library does not appear as an option when I try to install packages manually

  • Run RStudio as an administrator, then this option will appear.
  • To set-up RStudio to always run as administrator (advantageous when using an Rproject where you don’t click RStudio icon to open)… right-click the Rstudio icon

The image below shows how you can manually select the library to install a package to. This window appears when you open the Packages RStudio pane and click “Install”.

Pandoc 1 error

If you are getting “pandoc error 1” when knitting R Markdowns scripts on network drives:

Pandoc Error 83

The error will look something like this: can't find file...rmarkdown...lua.... This means that it was unable to find this file.

See https://stackoverflow.com/questions/58830927/rmarkdown-unable-to-locate-lua-filter-when-knitting-to-word

Possibilities:

  1. Rmarkdown package is not installed
  2. Rmarkdown package is not findable
  3. An admin rights issue.

It is possible that R is not able to find the rmarkdown package file, so check which library the rmarkdown package lives (see code above). If the package is installed to a library that in inaccessible (e.g. starts with "\") consider manually moving it to C: or other named drive library. Be aware that the rmarkdown package has to be able to connect to TinyTex installation, so can not live in a library on a network drive.

Pandoc Error 61

For example: Error: pandoc document conversion failed with error 61 or Could not fetch...

  • Try running RStudio as administrator (right click icon, select run as admin, see above instructions)
  • Also see if the specific package that was unable to be reached can be moved to C: library.

LaTex error (see below)

An error like: ! Package pdftex.def Error: File 'cict_qm2_2020-06-29_files/figure-latex/unnamed-chunk-5-1.png' not found: using draft setting. or Error: LaTeX failed to compile file_name.tex.

Pandoc Error 127

This could be a RAM (space) issue. Re-start your R session and try again.

Mapping network drives

Mapping a network drive can be risky. Consult with your IT department before attempting this.

A tip borrowed from this forum discussion:

How does one open a file “through a mapped network drive”?

  • First, you’ll need to know the network location you’re trying to access.
  • Next, in the Windows file manager, you will need to right click on “This PC” on the right hand pane, and select “Map a network drive”.
  • Go through the dialogue to define the network location from earlier as a lettered drive.
  • Now you have two ways to get to the file you’re opening. Using the drive-letter path should work.

Error in install.packages()

If you get an error that includes mention of a “lock” directory, for example: Error in install.packages : ERROR: failed to lock directory...

Look in your package library and you will see a folder whose name begins with “00LOCK”. Try the following tips:

  • Manually delete the “00LOCK” folder directory from your package library. Try installing the package again.
  • You can also try the command pacman::p_unlock() (you can also put this command in the Rprofile so it runs every time project opens.). Then try installing the package again. It may take several tries.
  • Try running RStudio in Administrator mode, and try installing the packages one-by-one.
  • If all else fails, install the package to another library or folder (e.g. Temp) and then manually copy the package’s folder over to the desired library.

50 Data Table

The handbook focusses on the dplyr “verb” functions and the magrittr pipe operator %>% as a method to clean and group data, but the data.table package offers an alternative method that you may encounter in your R career.

50.1 Intro to data tables

A data table is a 2-dimensional data structure like a data frame that allows complex grouping operations to be performed. The data.table syntax is structured so that operations can be performed on rows, columns and groups.

The structure is DT[i, j, by], separated by 3 parts; the i, j and by arguments. The i argument allows for subsetting of required rows, the j argument allows you to operate on columns and the by argument allows you operate on columns by groups.

This page will address the following topics:

  • Importing data and use of fread() and fwrite()
  • Selecting and filtering rows using the i argument
  • Using helper functions %like%, %chin%, %between%
  • Selecting and computing on columns using the j argument
  • Computing by groups using the by argument
  • Adding and updating data to data tables using :=

50.2 Load packages and import data

Load packages

Using the p_load() function from pacman, we load (and install if necessary) packages required for this analysis.

pacman::p_load(
  rio,        # to import data
  data.table, # to group and clean data
  tidyverse,  # allows use of pipe (%>%) function in this chapter
  here 
  ) 

Import data

This page will explore some of the core functions of data.table using the case linelist referenced throughout the handbook.

We import the dataset of cases from a simulated Ebola epidemic. If you want to download the data to follow step-by-step, see instructions in the [Download book and data] page. The dataset is imported using the import() function from the rio package. See the page on Import and export for various ways to import data. From here we use data.table() to convert the data frame to a data table.

linelist <- rio::import(here("data", "linelist_cleaned.xlsx")) %>% data.table()

The fread() function is used to directly import regular delimited files, such as .csv files, directly to a data table format. This function, and its counterpart, fwrite(), used for writing data.tables as regular delimited files are very fast and computationally efficient options for large databases.

The first 20 rows of linelist:

Base R commands such as dim() that are used for data frames can also be used for data tables

dim(linelist) #gives the number of rows and columns in the data table
## [1] 5888   30

50.3 The i argument: selecting and filtering rows

Recalling the DT[i, j, by] structure, we can filter rows using either row numbers or logical expressions. The i argument is first; therefore, the syntax DT[i] or DT[i,] can be used.

The first example retrieves the first 5 rows of the data table, the second example subsets cases are 18 years or over, and the third example subsets cases 18 years old or over but not diagnosed at the Central Hospital:

linelist[1:5] #returns the 1st to 5th row
linelist[age >= 18] #subsets cases are equal to or over 18 years
linelist[age >= 18 & hospital != "Central Hospital"] #subsets cases equal to or over 18 years old but not diagnosed at the Central Hospital

Using .N in the i argument represents the total number of rows in the data table. This can be used to subset on the row numbers:

linelist[.N] #returns the last row
linelist[15:.N] #returns the 15th to the last row

Using helper functions for filtering

Data table uses helper functions that make subsetting rows easy. The %like% function is used to match a pattern in a column, %chin% is used to match a specific character, and the %between% helper function is used to match numeric columns within a prespecified range.

In the following examples we: * filter rows where the hospital variable contains “Hospital” * filter rows where the outcome is “Recover” or “Death” * filter rows in the age range 40-60

linelist[hospital %like% "Hospital"] #filter rows where the hospital variable contains “Hospital”
linelist[outcome %chin% c("Recover", "Death")] #filter rows where the outcome is “Recover” or “Death”
linelist[age %between% c(40, 60)] #filter rows in the age range 40-60

#%between% must take a vector of length 2, whereas %chin% can take vectors of length >= 1

50.4 The j argument: selecting and computing on columns

Using the DT[i, j, by] structure, we can select columns using numbers or names. The j argument is second; therefore, the syntax DT[, j] is used. To facilitate computations on the j argument, the column is wrapped using either list() or .().

Selecting columns

The first example retrieves the first, third and fifth columns of the data table, the second example selects all columns except the height, weight and gender columns. The third example uses the .() wrap to select the case_id and outcome columns.

linelist[ , c(1,3,5)]
linelist[ , -c("gender", "age", "wt_kg", "ht_cm")]
linelist[ , list(case_id, outcome)] #linelist[ , .(case_id, outcome)] works just as well

Computing on columns

By combining the i and j arguments it is possible to filter rows and compute on the columns. Using .N in the j argument also represents the total number of rows in the data table and can be useful to return the number of rows after row filtering.

In the following examples we: * Count the number of cases that stayed over 7 days in hospital * Calculate the mean age of the cases that died at the military hospital * Calculate the standard deviation, median, mean age of the cases that recovered at the central hospital

linelist[days_onset_hosp > 7 , .N]
## [1] 189
linelist[hospital %like% "Military" & outcome %chin% "Death", .(mean(age, na.rm = T))] #na.rm = T removes N/A values
##         V1
## 1: 15.9084
linelist[hospital == "Central Hospital" & outcome == "Recover", 
                 .(mean_age = mean(age, na.rm = T),
                   median_age = median(age, na.rm = T),
                   sd_age = sd(age, na.rm = T))] #this syntax does not use the helper functions but works just as well
##    mean_age median_age   sd_age
## 1: 16.85185         14 12.93857

Remember using the .() wrap in the j argument facilitates computation, returns a data table and allows for column naming.

50.5 The by argument: computing by groups

The by argument is the third argument in the DT[i, j, by] structure. The by argument accepts both a character vector and the list() or .() syntax. Using the .() syntax in the by argument allows column renaming on the fly.

In the following examples we:
* group the number of cases by hospital * in cases 18 years old or over, calculate the mean height and weight of cases according to gender and whether they recovered or died * in admissions that lasted over 7 days, count the number of cases according to the month they were admitted and the hospital they were admitted to

linelist[, .N, .(hospital)] #the number of cases by hospital
##                                hospital    N
## 1:                                Other  885
## 2:                              Missing 1469
## 3: St. Mark's Maternity Hospital (SMMH)  422
## 4:                        Port Hospital 1762
## 5:                    Military Hospital  896
## 6:                     Central Hospital  454
linelist[age > 18, .(mean_wt = mean(wt_kg, na.rm = T),
                             mean_ht = mean(ht_cm, na.rm = T)), .(gender, outcome)] #NAs represent the categories where the data is missing
##    gender outcome  mean_wt  mean_ht
## 1:      m Recover 71.90227 178.1977
## 2:      f   Death 63.27273 159.9448
## 3:      m   Death 71.61770 175.4726
## 4:      f    <NA> 64.49375 162.7875
## 5:      m    <NA> 72.65505 176.9686
## 6:      f Recover 62.86498 159.2996
## 7:   <NA> Recover 67.21429 175.2143
## 8:   <NA>   Death 69.16667 170.7917
## 9:   <NA>    <NA> 70.25000 175.5000
linelist[days_onset_hosp > 7, .N, .(month = month(date_hospitalisation), hospital)]
##     month                             hospital  N
##  1:     5                    Military Hospital  3
##  2:     6                        Port Hospital  4
##  3:     7                        Port Hospital  8
##  4:     8 St. Mark's Maternity Hospital (SMMH)  5
##  5:     8                    Military Hospital  9
##  6:     8                                Other 10
##  7:     8                        Port Hospital 10
##  8:     9                        Port Hospital 28
##  9:     9                              Missing 27
## 10:     9                     Central Hospital 10
## 11:     9 St. Mark's Maternity Hospital (SMMH)  6
## 12:    10                              Missing  2
## 13:    10                    Military Hospital  3
## 14:     3                        Port Hospital  1
## 15:     4                    Military Hospital  1
## 16:     5                                Other  2
## 17:     5                     Central Hospital  1
## 18:     5                              Missing  1
## 19:     6                              Missing  7
## 20:     6 St. Mark's Maternity Hospital (SMMH)  2
## 21:     6                    Military Hospital  1
## 22:     7                    Military Hospital  3
## 23:     7                                Other  1
## 24:     7                              Missing  2
## 25:     7 St. Mark's Maternity Hospital (SMMH)  1
## 26:     8                     Central Hospital  2
## 27:     8                              Missing  6
## 28:     9                                Other  9
## 29:     9                    Military Hospital 11
## 30:    10                        Port Hospital  3
## 31:    10                                Other  4
## 32:    10 St. Mark's Maternity Hospital (SMMH)  1
## 33:    10                     Central Hospital  1
## 34:    11                              Missing  2
## 35:    11                        Port Hospital  1
## 36:    12                        Port Hospital  1
##     month                             hospital  N

Data.table also allows the chaining expressions as follows:

linelist[, .N, .(hospital)][order(-N)][1:3] #1st selects all cases by hospital, 2nd orders the cases in descending order, 3rd subsets the 3 hospitals with the largest caseload
##             hospital    N
## 1:     Port Hospital 1762
## 2:           Missing 1469
## 3: Military Hospital  896

In these examples we are following the assumption that a row in the data table is equal to a new case, and so we can use the .N to represent the number of rows in the data table. Another useful function to represent the number of unique cases is uniqueN(), which returns the number of unique values in a given input. This is illustrated here:

linelist[, .(uniqueN(gender))] #remember .() in the j argument returns a data table
##    V1
## 1:  3

The answer is 3, as the unique values in the gender column are m, f and N/A. Compare with the base R function unique(), which returns all the unique values in a given input:

linelist[, .(unique(gender))]
##      V1
## 1:    m
## 2:    f
## 3: <NA>

To find the number of unique cases in a given month we would write the following:

linelist[, .(uniqueN(case_id)), .(month = month(date_hospitalisation))]
##     month   V1
##  1:     5   62
##  2:     6  100
##  3:     7  198
##  4:     8  509
##  5:     9 1170
##  6:    10 1228
##  7:    11  813
##  8:    12  576
##  9:     1  434
## 10:     2  310
## 11:     3  290
## 12:     4  198

50.6 Adding and updating to data tables

The := operator is used to add or update data in a data table. Adding columns to your data table can be done in the following ways:

linelist[, adult := age >= 18] #adds one column
linelist[, c("child", "wt_lbs") := .(age < 18, wt_kg*2.204)] #to add multiple columns requires c("") and list() or .() syntax
linelist[, `:=` (bmi_in_range = (bmi > 16 & bmi < 40),
                         no_infector_source_data = is.na(infector) | is.na(source))] #this method uses := as a functional operator `:=`
linelist[, adult := NULL] #deletes the column

Further complex aggregations are beyond the scope of this introductory chapter, but the idea is to provide a popular and viable alternative to dplyr for grouping and cleaning data. The data.table package is a great package that allows for neat and readable code.